Achieving a balance between high-fidelity visual quality and low-latency streaming remains a formidable challenge in audio-driven portrait generation. Existing large-scale models often suffer from prohibitive computational costs, while lightweight alternatives typically compromise on holistic facial representations and temporal stability. In this paper, we propose SoulX-FlashHead, a unified 1.3B-parameter framework designed for real-time, infinite-length, and high-fidelity streaming video generation. To address the instability of audio features in streaming scenarios, we introduce Streaming-Aware Spatiotemporal Pre-training equipped with a Temporal Audio Context Cache mechanism, which ensures robust feature extraction from short audio fragments. Furthermore, to mitigate the error accumulation and identity drift inherent in long-sequence autoregressive generation, we propose Oracle-Guided Bidirectional Distillation, leveraging ground-truth motion priors to provide precise physical guidance. We also present VividHead, a large-scale, high-quality dataset containing 782 hours of strictly aligned footage to support robust training. Extensive experiments demonstrate that SoulX-FlashHead achieves state-of-the-art performance on HDTF and VFHQ benchmarks. Notably, our Lite variant achieves an inference speed of 96 FPS on a single NVIDIA RTX 4090, facilitating ultra-fast interaction without sacrificing visual coherence.
在音频驱动的肖像生成领域,实现高保真视觉质量和低延迟流媒体之间的平衡仍然是一个重大挑战。现有的大规模模型通常面临计算成本过高的问题,而轻量级替代方案则往往牺牲整体面部表示和时间稳定性。为此,本文提出了SoulX-FlashHead,这是一种统一的13亿参数框架,旨在实现实时、无限长度且高保真的流媒体视频生成。 为了应对流媒体场景中音频特征不稳定的问题,我们引入了“流感知时空预训练”(Streaming-Aware Spatiotemporal Pre-training),并配备了一个时间音频上下文缓存机制(Temporal Audio Context Cache),以确保从短音频片段中提取稳健的特征。此外,为了解决长时间序列自回归生成过程中固有的错误累积和身份漂移问题,我们提出了Oracle-Guided双向蒸馏(Oracle-Guided Bidirectional Distillation)方法,利用地面真实运动先验提供精确的物理引导。 为了支持稳健训练,我们还推出了VividHead,这是一个大规模高质量的数据集,包含782小时严格对齐的视频片段。经过广泛的实验验证,SoulX-FlashHead在HDTF和VFHQ基准测试中实现了最先进的性能表现。值得一提的是,我们的Lite变体能够在单个NVIDIA RTX 4090上实现每秒96帧(FPS)的推理速度,从而支持超快交互的同时不牺牲视觉连贯性。
https://arxiv.org/abs/2602.07449
MRI cross-modal synthesis involves generating images from one acquisition protocol using another, offering considerable clinical value by reducing scan time while maintaining diagnostic information. This paper presents a comprehensive comparison of three state-of-the-art generative models for T1-to-T2 MRI reconstruction: Pix2Pix GAN, CycleGAN, and Variational Autoencoder (VAE). Using the BraTS 2020 dataset (11,439 training and 2,000 testing slices), we evaluate these models based on established metrics including Mean Squared Error (MSE), Peak Signal-to-Noise Ratio (PSNR), and Structural Similarity Index (SSIM). Our experiments demonstrate that all models can successfully synthesize T2 images from T1 inputs, with CycleGAN achieving the highest PSNR (32.28 dB) and SSIM (0.9008), while Pix2Pix GAN provides the lowest MSE (0.005846). The VAE, though showing lower quantitative performance (MSE: 0.006949, PSNR: 24.95 dB, SSIM: 0.6573), offers advantages in latent space representation and sampling capabilities. This comparative study provides valuable insights for researchers and clinicians selecting appropriate generative models for MRI synthesis applications based on their specific requirements and data constraints.
MRI跨模态合成涉及使用一种获取协议生成另一种协议下的图像,这在减少扫描时间的同时保持诊断信息方面具有重要的临床价值。本文详细介绍并比较了三种最先进的T1到T2 MRI重建生成模型:Pix2Pix GAN、CycleGAN和变分自动编码器(VAE)。我们利用BraTS 2020数据集(包含11,439个训练切片和2,000个测试切片),基于包括均方误差(MSE)、峰值信噪比(PSNR)以及结构相似性指数(SSIM)在内的标准指标对这些模型进行评估。实验结果表明,所有模型都能够成功地从T1输入生成T2图像,其中CycleGAN在PSNR(32.28 dB)和SSIM(0.9008)两个指标上表现最优,而Pix2Pix GAN则提供了最低的MSE值(0.005846)。虽然VAE在量化性能方面略逊一筹(MSE: 0.006949, PSNR: 24.95 dB, SSIM: 0.6573),但在潜在空间表示和采样能力上具有优势。这项对比研究为研究人员和临床医生选择适合MRI合成应用的生成模型提供了有价值的见解,可以根据其特定需求和数据限制进行选择。
https://arxiv.org/abs/2602.07068
Multi-organ segmentation is a widely applied clinical routine and automated organ segmentation tools dramatically improve the pipeline of the radiologists. Recently, deep learning (DL) based segmentation models have shown the capacity to accomplish such a task. However, the training of the segmentation networks requires large amount of data with manual annotations, which is a major concern due to the data scarcity from clinic. Working with limited data is still common for researches on novel imaging modalities. To enhance the effectiveness of DL models trained with limited data, data augmentation (DA) is a crucial regularization technique. Traditional DA (TDA) strategies focus on basic intra-image operations, i.e. generating images with different orientations and intensity distributions. In contrast, the interimage and object-level DA operations are able to create new images from separate individuals. However, such DA strategies are not well explored on the task of multi-organ segmentation. In this paper, we investigated four possible inter-image DA strategies: CutMix, CarveMix, ObjectAug and AnatoMix, on two organ segmentation datasets. The result shows that CutMix, CarveMix and AnatoMix can improve the average dice score by 4.9, 2.0 and 1.9, compared with the state-of-the-art nnUNet without DA strategies. These results can be further improved by adding TDA strategies. It is revealed in our experiments that Cut-Mix is a robust but simple DA strategy to drive up the segmentation performance for multi-organ segmentation, even when CutMix produces intuitively 'wrong' images. Our implementation is publicly available for future benchmarks.
多器官分割是一种广泛应用于临床的常规流程,而自动化的器官分割工具能够显著提升放射科医生的工作效率。近年来,基于深度学习(DL)的分割模型显示出完成此类任务的能力。然而,训练分割网络需要大量带有手动注释的数据,而这在临床上是一个重大问题,因为数据稀缺是普遍存在的。对于新型成像模式的研究工作,在有限数据的情况下仍是很常见的状况。为了增强使用有限数据训练的深度学习模型的效果,数据扩充(DA)是一种关键的技术手段。传统的数据扩充(TDA)策略主要集中在图像内部的基本操作上,比如生成不同方向和强度分布的图像。相比之下,跨图和对象级别的数据扩充操作能够从不同的个体创建新的图像。然而,在多器官分割任务中,这些数据扩充策略尚未被充分研究。 在本文中,我们对四种可能的跨图数据扩充策略:CutMix、CarveMix、ObjectAug 和 AnatoMix 在两个器官分割数据集上进行了调查。结果显示,与最先进的 nnUNet(未使用数据扩充策略)相比,CutMix、CarveMix 和 AnatoMix 可以分别将平均 Dice 系数提高 4.9、2.0 和 1.9。通过添加传统数据扩充策略,这些结果可以进一步改进。在我们的实验中发现,即使生成的图像从直观上看似乎是“错误”的,CutMix 是一种简单而稳健的数据扩充策略,能够显著提升多器官分割的表现。 我们的实现方案将公开提供给未来的研究和基准测试使用。
https://arxiv.org/abs/2602.03555
Benefiting from the significant advancements in text-to-image diffusion models, research in personalized image generation, particularly customized portrait generation, has also made great strides recently. However, existing methods either require time-consuming fine-tuning and lack generalizability or fail to achieve high fidelity in facial details. To address these issues, we propose FaceSnap, a novel method based on Stable Diffusion (SD) that requires only a single reference image and produces extremely consistent results in a single inference stage. This method is plug-and-play and can be easily extended to different SD models. Specifically, we design a new Facial Attribute Mixer that can extract comprehensive fused information from both low-level specific features and high-level abstract features, providing better guidance for image generation. We also introduce a Landmark Predictor that maintains reference identity across landmarks with different poses, providing diverse yet detailed spatial control conditions for image generation. Then we use an ID-preserving module to inject these into the UNet. Experimental results demonstrate that our approach performs remarkably in personalized and customized portrait generation, surpassing other state-of-the-art methods in this domain.
https://arxiv.org/abs/2602.00627
Human vision combines low-resolution "gist" information from the visual periphery with sparse but high-resolution information from fixated locations to construct a coherent understanding of a visual scene. In this paper, we introduce MetamerGen, a tool for generating scenes that are aligned with latent human scene representations. MetamerGen is a latent diffusion model that combines peripherally obtained scene gist information with information obtained from scene-viewing fixations to generate image metamers for what humans understand after viewing a scene. Generating images from both high and low resolution (i.e. "foveated") inputs constitutes a novel image-to-image synthesis problem, which we tackle by introducing a dual-stream representation of the foveated scenes consisting of DINOv2 tokens that fuse detailed features from fixated areas with peripherally degraded features capturing scene context. To evaluate the perceptual alignment of MetamerGen generated images to latent human scene representations, we conducted a same-different behavioral experiment where participants were asked for a "same" or "different" response between the generated and the original image. With that, we identify scene generations that are indeed metamers for the latent scene representations formed by the viewers. MetamerGen is a powerful tool for understanding scene understanding. Our proof-of-concept analyses uncovered specific features at multiple levels of visual processing that contributed to human judgments. While it can generate metamers even conditioned on random fixations, we find that high-level semantic alignment most strongly predicts metamerism when the generated scenes are conditioned on viewers' own fixated regions.
人类的视觉系统结合了来自视网膜周边区域的低分辨率“概览”信息与固定注视点处稀疏但高分辨率的信息,从而构建出对视觉场景的一致理解。本文介绍了MetamerGen这一工具,它用于生成符合潜在的人类场景表示的场景图像。MetamerGen是一种潜扩散模型,结合了从视网膜周边区域获取的概览信息和从固定注视点处获得的信息,以生成与人类在观看某一场景后所形成的理解一致的图像代用品(metamers)。同时从高分辨率(即“中央凹”)和低分辨率输入中生成图像构成了一个新的图到图合成问题。为了解决这一问题,我们引入了双重流表示形式来描述固定注视区域的场景:DINOv2令牌融合来自固定注视点详细特征与周边区域捕捉场景背景信息的降级特征。 为了评估MetamerGen生成的图像与其潜在的人类场景表征之间的感知一致性,我们进行了一个相同/不同行为实验,在该实验中参与者被要求在生成图像和原始图像之间做出“相同”或“不同”的判断。通过这种方式,我们可以识别出与观众形成的场景表示一致的场景代用品。MetamerGen是一个理解场景理解的强大工具。我们的概念验证分析揭示了视觉处理多个层级中的特定特征对人类判断的影响。尽管它可以基于随机注视点生成图像代用品,我们发现当产生的场景条件设定为观众自己的固定区域时,高层次语义一致性最能预测其是否为图像代用品。 该研究展示了通过结合不同层次的视觉信息来合成逼真的场景图像的新方法,并揭示了这些合成图像与人类感知之间的关联。MetamerGen不仅在技术上具有创新性,在理解人类如何处理复杂视觉场景方面也提供了重要的见解。
https://arxiv.org/abs/2601.11675
Recent advancements in diffusion-based technologies have made significant strides, particularly in identity-preserved portrait generation (IPG). However, when using multiple reference images from the same ID, existing methods typically produce lower-fidelity portraits and struggle to customize face attributes precisely. To address these issues, this paper presents HiFi-Portrait, a high-fidelity method for zero-shot portrait generation. Specifically, we first introduce the face refiner and landmark generator to obtain fine-grained multi-face features and 3D-aware face landmarks. The landmarks include the reference ID and the target attributes. Then, we design HiFi-Net to fuse multi-face features and align them with landmarks, which improves ID fidelity and face control. In addition, we devise an automated pipeline to construct an ID-based dataset for training HiFi-Portrait. Extensive experimental results demonstrate that our method surpasses the SOTA approaches in face similarity and controllability. Furthermore, our method is also compatible with previous SDXL-based works.
最近在基于扩散的技术方面取得了一些显著进展,特别是在保留身份的肖像生成(IPG)领域。然而,当使用来自同一ID的多张参考图像时,现有方法通常会产生较低质量的肖像,并且难以精确定制面部属性。为了解决这些问题,本文提出了一种名为HiFi-Portrait的方法,这是一种用于零样本肖像生成的高保真技术。 具体来说,我们首先引入了面部精炼器和地标生成器来获取细粒度多张人脸特征及具有3D感知的人脸地标信息。这些地标包含参考ID和目标属性的信息。然后,我们设计了HiFi-Net网络用于融合多个人脸特征,并将它们与地标对齐,这提升了身份保真度并增强了面部控制能力。 此外,我们还开发了一种自动化管道来构建基于ID的数据集,以便训练HiFi-Portrait模型。广泛的实验结果表明,我们的方法在人脸相似性和可控性方面超越了现有的最先进(SOTA)方法。而且,我们的方法还可以与之前的SDXL相关工作兼容使用。
https://arxiv.org/abs/2512.14542
The human visual environment is comprised of different surfaces that are distributed in space. The parts of a scene that are visible at any one time are governed by the occlusion of overlapping objects. In this work we consider "dead leaves" models, which replicate these occlusions when generating images by layering objects on top of each other. A dead leaves model is a generative model comprised of distributions for object position, shape, color and texture. An image is generated from a dead leaves model by sampling objects ("leaves") from these distributions until a stopping criterion is reached, usually when the image is fully covered or until a given number of leaves was sampled. Here, we describe a theoretical approach, based on previous work, to derive a Bayesian ideal observer for the partition of a given set of pixels based on independent dead leaves model distributions. Extending previous work, we provide step-by-step explanations for the computation of the posterior probability as well as describe factors that determine the feasibility of practically applying this computation. The dead leaves image model and the associated ideal observer can be applied to study segmentation decisions in a limited number of pixels, providing a principled upper-bound on performance, to which humans and vision algorithms could be compared.
人类的视觉环境由分布在空间中的不同表面组成。在任何时候,场景中可见的部分受到重叠对象遮挡的影响。在这项工作中,我们考虑了“枯叶”模型,这种模型通过将物体一层层叠加来模拟这些遮挡效果,在生成图像时复制这些现象。“枯叶”模型是一种生成模型,由物体的位置、形状、颜色和纹理的分布组成。从“枯叶”模型中生成一张图像是通过从这些分布中抽样物体(“叶子”),直到达到停止标准为止,这个标准通常是当整张图片完全覆盖或达到了指定数量的“叶子”。在这篇文章中,我们基于先前的工作描述了一种理论方法,用于推导出一种贝叶斯理想观察者模型,该模型根据独立的“枯叶”模型分布来划分一组给定像素。扩展了以前的研究成果,我们提供了逐步解释后验概率计算的方法,并且还详细说明了决定实际应用这种计算可行性的因素。“枯叶”图像模型以及相关的理想观察者可以用于研究在有限数量的像素中的分割决策问题,这为性能提供了一个理论上的上限,人类和视觉算法可以与之进行比较。
https://arxiv.org/abs/2512.05539
Diffusion models have emerged as a leading technique for generating images due to their ability to create high-resolution and realistic images. Despite their strong performance, diffusion models still struggle in managing image collections with significant feature differences. They often fail to capture complex features and produce conflicting results. Research has attempted to address this issue by learning different regions of an image through multiple diffusion paths and then combining them. However, this approach leads to inefficient coordination among multiple paths and high computational costs. To tackle these issues, this paper presents a Diffusion Fuzzy System (DFS), a latent-space multi-path diffusion model guided by fuzzy rules. DFS offers several advantages. First, unlike traditional multi-path diffusion methods, DFS uses multiple diffusion paths, each dedicated to learning a specific class of image features. By assigning each path to a different feature type, DFS overcomes the limitations of multi-path models in capturing heterogeneous image features. Second, DFS employs rule-chain-based reasoning to dynamically steer the diffusion process and enable efficient coordination among multiple paths. Finally, DFS introduces a fuzzy membership-based latent-space compression mechanism to reduce the computational costs of multi-path diffusion effectively. We tested our method on three public datasets: LSUN Bedroom, LSUN Church, and MS COCO. The results show that DFS achieves more stable training and faster convergence than existing single-path and multi-path diffusion models. Additionally, DFS surpasses baseline models in both image quality and alignment between text and images, and also shows improved accuracy when comparing generated images to target references.
扩散模型由于能够生成高分辨率和逼真的图像而成为一种领先的图像生成技术。尽管它们表现出色,但扩散模型在处理具有显著特征差异的图像集合时仍存在挑战,经常无法捕捉复杂特性并产生矛盾的结果。为了解决这个问题,研究尝试通过学习图像的不同区域并通过多个扩散路径来组合它们的方法。然而,这种方法导致了多条路径之间协调效率低下和计算成本高昂的问题。 为了应对这些挑战,本文提出了一种基于模糊规则的扩散模糊系统(DFS),这是一种在潜在空间中由模糊规则引导的多路径扩散模型。DFS具有几个优点:首先,与传统的多路径扩散方法不同,DFS使用多个独立的扩散路径来专门学习某一类图像特征。通过将每个路径分配给不同的特征类型,DFS克服了多路径模型在捕捉异质图像特征方面的局限性。 其次,DFS采用基于规则链的推理机制,以动态引导扩散过程,并使多条路径之间的协调更加高效。最后,DFS引入了一种基于模糊隶属度的潜在空间压缩机制,有效降低了多路径扩散的计算成本。我们在三个公共数据集(LSUN Bedroom、LSUN Church和MS COCO)上测试了我们的方法,结果显示DFS在训练稳定性以及收敛速度方面都优于现有的单路径和多路径扩散模型。 此外,DFS在图像质量和文本与图像之间的对齐度方面超越了基线模型,并且在将生成的图像与目标参考进行比较时也显示出更高的准确性。
https://arxiv.org/abs/2512.01533
Image generation can provide physicians with an imaging diagnosis basis in the prediction of Alzheimer's Disease (AD). Recent research has shown that long-term AD predictions by image generation often face difficulties maintaining disease-related characteristics when dealing with irregular time intervals in sequential data. Considering that the time-related aspects of the distribution can reflect changes in disease-related characteristics when images are distributed unevenly, this research proposes a model to estimate the temporal parameter within the Normal Inverse Gamma Distribution (T-NIG) to assist in generating images over the long term. The T-NIG model employs brain images from two different time points to create intermediate brain images, forecast future images, and predict the disease. T-NIG is designed by identifying features using coordinate neighborhoods. It incorporates a time parameter into the normal inverse gamma distribution to understand how features change in brain imaging sequences that have varying time intervals. Additionally, T-NIG utilizes uncertainty estimation to reduce both epistemic and aleatoric uncertainties in the model, which arise from insufficient temporal data. In particular, the T-NIG model demonstrates state-of-the-art performance in both short-term and long-term prediction tasks within the dataset. Experimental results indicate that T-NIG is proficient in forecasting disease progression while maintaining disease-related characteristics, even when faced with an irregular temporal data distribution.
https://arxiv.org/abs/2511.21057
Generative image models can produce convincingly real images, with plausible shapes, textures, layouts and lighting. However, one domain in which they perform notably poorly is in the synthesis of transparent objects, which exhibit refraction, reflection, absorption and scattering. Refraction is a particular challenge, because refracted pixel rays often intersect with surfaces observed in other parts of the image, providing a constraint on the color. It is clear from inspection that generative models have not distilled the laws of optics sufficiently well to accurately render refractive objects. In this work, we consider the problem of generating images with accurate refraction, given a text prompt. We synchronize the pixels within the object's boundary with those outside by warping and merging the pixels using Snell's Law of Refraction, at each step of the generation trajectory. For those surfaces that are not directly observed in the image, but are visible via refraction or reflection, we recover their appearance by synchronizing the image with a second generated image -- a panorama centered at the object -- using the same warping and merging procedure. We demonstrate that our approach generates much more optically-plausible images that respect the physical constraints.
https://arxiv.org/abs/2511.17340
Personalized dual-person portrait customization has considerable potential applications, such as preserving emotional memories and facilitating wedding photography planning. However, the absence of a benchmark dataset hinders the pursuit of high-quality customization in dual-person portrait generation. In this paper, we propose the PairHuman dataset, which is the first large-scale benchmark dataset specifically designed for generating dual-person portraits that meet high photographic standards. The PairHuman dataset contains more than 100K images that capture a variety of scenes, attire, and dual-person interactions, along with rich metadata, including detailed image descriptions, person localization, human keypoints, and attribute tags. We also introduce DHumanDiff, which is a baseline specifically crafted for dual-person portrait generation that features enhanced facial consistency and simultaneously balances in personalized person generation and semantic-driven scene creation. Finally, the experimental results demonstrate that our dataset and method produce highly customized portraits with superior visual quality that are tailored to human preferences. Our dataset is publicly available at this https URL.
https://arxiv.org/abs/2511.16712
We present Uni-MoE 2.0 from the Lychee family. As a fully open-source omnimodal large model (OLM), it substantially advances Lychee's Uni-MoE series in language-centric multimodal understanding, reasoning, and generating. Based on the Qwen2.5-7B dense architecture, we build Uni-MoE-2.0-Omni from scratch through three core contributions: dynamic-capacity Mixture-of-Experts (MoE) design, a progressive training strategy enhanced with an iterative reinforcement strategy, and a carefully curated multimodal data matching technique. It is capable of omnimodal understanding, as well as generating images, text, and speech. Architecturally, our new MoE framework balances computational efficiency and capability for 10 cross-modal inputs using shared, routed, and null experts, while our Omni-Modality 3D RoPE ensures spatio-temporal cross-modality alignment in the self-attention layer. For training, following cross-modal pretraining, we use a progressive supervised fine-tuning strategy that activates modality-specific experts and is enhanced by balanced data composition and an iterative GSPO-DPO method to stabilise RL training and improve reasoning. Data-wise, the base model, trained on approximately 75B tokens of open-source multimodal data, is equipped with special speech and image generation tokens, allowing it to learn these generative tasks by conditioning its outputs on linguistic cues. Extensive evaluation across 85 benchmarks demonstrates that our model achieves SOTA or highly competitive performance against leading OLMs, surpassing Qwen2.5-Omni (trained with 1.2T tokens) on over 50 of 76 benchmarks. Key strengths include video understanding (+7% avg. of 8), omnimodallity understanding (+7% avg. of 4), and audiovisual reasoning (+4%). It also advances long-form speech processing (reducing WER by 4.2%) and leads in low-level image processing and controllable generation across 5 metrics.
https://arxiv.org/abs/2511.12609
Recent advances in text-to-image models have enabled a new era of creative and controllable image generation. However, generating compositional scenes with multiple subjects and attributes remains a significant challenge. To enhance user control over subject placement, several layout-guided methods have been proposed. However, these methods face numerous challenges, particularly in compositional scenes. Unintended subjects often appear outside the layouts, generated images can be out-of-distribution and contain unnatural artifacts, or attributes bleed across subjects, leading to incorrect visual outputs. In this work, we propose MALeR, a method that addresses each of these challenges. Given a text prompt and corresponding layouts, our method prevents subjects from appearing outside the given layouts while being in-distribution. Additionally, we propose a masked, attribute-aware binding mechanism that prevents attribute leakage, enabling accurate rendering of subjects with multiple attributes, even in complex compositional scenes. Qualitative and quantitative evaluation demonstrates that our method achieves superior performance in compositional accuracy, generation consistency, and attribute binding compared to previous work. MALeR is particularly adept at generating images of scenes with multiple subjects and multiple attributes per subject.
https://arxiv.org/abs/2511.06002
Text-to-image diffusion models often exhibit degraded performance when generating images beyond their training resolution. Recent training-free methods can mitigate this limitation, but they often require substantial computation or are incompatible with recent Diffusion Transformer models. In this paper, we propose ScaleDiff, a model-agnostic and highly efficient framework for extending the resolution of pretrained diffusion models without any additional training. A core component of our framework is Neighborhood Patch Attention (NPA), an efficient mechanism that reduces computational redundancy in the self-attention layer with non-overlapping patches. We integrate NPA into an SDEdit pipeline and introduce Latent Frequency Mixing (LFM) to better generate fine details. Furthermore, we apply Structure Guidance to enhance global structure during the denoising process. Experimental results demonstrate that ScaleDiff achieves state-of-the-art performance among training-free methods in terms of both image quality and inference speed on both U-Net and Diffusion Transformer architectures.
https://arxiv.org/abs/2510.25818
Unlike existing methods that rely on source images as appearance references and use source speech to generate motion, this work proposes a novel approach that directly extracts information from the speech, addressing key challenges in speech-to-talking face. Specifically, we first employ a speech-to-face portrait generation stage, utilizing a speech-conditioned diffusion model combined with statistical facial prior and a sample-adaptive weighting module to achieve high-quality portrait generation. In the subsequent speech-driven talking face generation stage, we embed expressive dynamics such as lip movement, facial expressions, and eye movements into the latent space of the diffusion model and further optimize lip synchronization using a region-enhancement module. To generate high-resolution outputs, we integrate a pre-trained Transformer-based discrete codebook with an image rendering network, enhancing video frame details in an end-to-end manner. Experimental results demonstrate that our method outperforms existing approaches on the HDTF, VoxCeleb, and AVSpeech datasets. Notably, this is the first method capable of generating high-resolution, high-quality talking face videos exclusively from a single speech input.
https://arxiv.org/abs/2510.26819
We introduce TurboPortrait3D: a method for low-latency novel-view synthesis of human portraits. Our approach builds on the observation that existing image-to-3D models for portrait generation, while capable of producing renderable 3D representations, are prone to visual artifacts, often lack of detail, and tend to fail at fully preserving the identity of the subject. On the other hand, image diffusion models excel at generating high-quality images, but besides being computationally expensive, are not grounded in 3D and thus are not directly capable of producing multi-view consistent outputs. In this work, we demonstrate that image-space diffusion models can be used to significantly enhance the quality of existing image-to-avatar methods, while maintaining 3D-awareness and running with low-latency. Our method takes a single frontal image of a subject as input, and applies a feedforward image-to-avatar generation pipeline to obtain an initial 3D representation and corresponding noisy renders. These noisy renders are then fed to a single-step diffusion model which is conditioned on input image(s), and is specifically trained to refine the renders in a multi-view consistent way. Moreover, we introduce a novel effective training strategy that includes pre-training on a large corpus of synthetic multi-view data, followed by fine-tuning on high-quality real images. We demonstrate that our approach both qualitatively and quantitatively outperforms current state-of-the-art for portrait novel-view synthesis, while being efficient in time.
https://arxiv.org/abs/2510.23929
AutoRegressive (AR) models have demonstrated competitive performance in image generation, achieving results comparable to those of diffusion models. However, their token-by-token image generation mechanism remains computationally intensive and existing solutions such as VAR often lead to limited sample diversity. In this work, we propose a Nested AutoRegressive~(NestAR) model, which proposes nested AutoRegressive architectures in generating images. NestAR designs multi-scale modules in a hierarchical order. These different scaled modules are constructed in an AR architecture, where one larger-scale module is conditioned on outputs from its previous smaller-scale module. Within each module, NestAR uses another AR structure to generate ``patches'' of tokens. The proposed nested AR architecture reduces the overall complexity from $\mathcal{O}(n)$ to $\mathcal{O}(\log n)$ in generating $n$ image tokens, as well as increases image diversities. NestAR further incorporates flow matching loss to use continuous tokens, and develops objectives to coordinate these multi-scale modules in model training. NestAR achieves competitive image generation performance while significantly lowering computational cost.
https://arxiv.org/abs/2510.23028
Text-to-image models are known to struggle with generating images that perfectly align with textual prompts. Several previous studies have focused on evaluating image-text alignment in text-to-image generation. However, these evaluations either address overly simple scenarios, especially overlooking the difficulty of prompts with multiple different instances belonging to the same category, or they introduce metrics that do not correlate well with human evaluation. In this study, we introduce M$^3$T2IBench, a large-scale, multi-category, multi-instance, multi-relation along with an object-detection-based evaluation metric, $AlignScore$, which aligns closely with human evaluation. Our findings reveal that current open-source text-to-image models perform poorly on this challenging benchmark. Additionally, we propose the Revise-Then-Enforce approach to enhance image-text alignment. This training-free post-editing method demonstrates improvements in image-text alignment across a broad range of diffusion models. \footnote{Our code and data has been released in supplementary material and will be made publicly available after the paper is accepted.}
https://arxiv.org/abs/2510.23020
Subject-driven image generation models face a fundamental trade-off between identity preservation (fidelity) and prompt adherence (editability). While online reinforcement learning (RL), specifically GPRO, offers a promising solution, we find that a naive application of GRPO leads to competitive degradation, as the simple linear aggregation of rewards with static weights causes conflicting gradient signals and a misalignment with the temporal dynamics of the diffusion process. To overcome these limitations, we propose Customized-GRPO, a novel framework featuring two key innovations: (i) Synergy-Aware Reward Shaping (SARS), a non-linear mechanism that explicitly penalizes conflicted reward signals and amplifies synergistic ones, providing a sharper and more decisive gradient. (ii) Time-Aware Dynamic Weighting (TDW), which aligns the optimization pressure with the model's temporal dynamics by prioritizing prompt-following in the early, identity preservation in the later. Extensive experiments demonstrate that our method significantly outperforms naive GRPO baselines, successfully mitigating competitive degradation. Our model achieves a superior balance, generating images that both preserve key identity features and accurately adhere to complex textual prompts.
基于主题的图像生成模型在身份保真(fidelity)和指令遵循(editability)之间存在根本性的权衡。虽然在线强化学习(RL),特别是GPRO,提供了一个有前景的解决方案,但我们发现直接应用GRPO会导致竞争性退化,因为简单的线性奖励聚合使用静态权重导致了冲突的梯度信号,并且与扩散过程的时间动态不一致。为了解决这些问题,我们提出了Customized-GRPO,这是一个包含两个关键创新的新框架: (i) 协同感知奖励塑形(Synergy-Aware Reward Shaping, SARS):这是一种非线性机制,它明确地惩罚冲突的奖励信号并放大协同作用的信号,提供更尖锐和更具决定性的梯度。 (ii) 时间感知动态加权(Time-Aware Dynamic Weighting, TDW):通过优先考虑早期阶段遵循指令、晚期阶段保持身份特征,该方法使优化压力与模型的时间动态相一致。 广泛的实验表明,我们的方法显著优于简单的GRPO基线,成功地缓解了竞争性退化。我们的模型实现了更佳的平衡,生成出既能保留关键的身份特征又能准确遵循复杂文本提示的图像。
https://arxiv.org/abs/2510.18263
Recent generative data augmentation methods conditioned on both image and text prompts struggle to balance between fidelity and diversity, as it is challenging to preserve essential image details while aligning with varied text prompts. This challenge arises because representations in the synthesis process often become entangled with non-essential input image attributes such as environmental contexts, creating conflicts with text prompts intended to modify these elements. To address this, we propose a personalized image generation framework that uses a salient concept-aware image embedding model to reduce the influence of irrelevant visual details during the synthesis process, thereby maintaining intuitive alignment between image and text inputs. By generating images that better preserve class-discriminative features with additional controlled variations, our framework effectively enhances the diversity of training datasets and thereby improves the robustness of downstream models. Our approach demonstrates superior performance across eight fine-grained vision datasets, outperforming state-of-the-art augmentation methods with averaged classification accuracy improvements by 0.73% and 6.5% under conventional and long-tail settings, respectively.
最近的基于图像和文本提示的生成数据增强方法在保真度与多样性之间难以取得平衡,因为这些方法往往很难在保留图像关键细节的同时适应多样化的文本提示。这一挑战源于合成过程中表示的纠缠问题,即重要图像属性(如核心视觉特征)经常与其他非本质输入属性(例如环境背景信息)混杂在一起,导致与旨在修改这些元素的文本提示产生冲突。 为解决上述难题,我们提出了一种个性化图像生成框架,该框架采用显著概念感知的图像嵌入模型,减少合成过程中无关视觉细节的影响。这样一来,可以更好地保持图像和文本输入之间直观一致的关系。通过在保留类别判别性特征的同时引入额外控制变化,我们的方法能够有效增加训练数据集的多样性,并提升下游模型的鲁棒性。 实验结果表明,在八个细粒度视觉数据集中,相较于现有先进增强技术,我们的方法表现出色:常规设置下分类准确率平均提高了0.73%,长尾分布场景下则提升了6.5%。
https://arxiv.org/abs/2510.15194