Recent advancements in diffusion-based technologies have made significant strides, particularly in identity-preserved portrait generation (IPG). However, when using multiple reference images from the same ID, existing methods typically produce lower-fidelity portraits and struggle to customize face attributes precisely. To address these issues, this paper presents HiFi-Portrait, a high-fidelity method for zero-shot portrait generation. Specifically, we first introduce the face refiner and landmark generator to obtain fine-grained multi-face features and 3D-aware face landmarks. The landmarks include the reference ID and the target attributes. Then, we design HiFi-Net to fuse multi-face features and align them with landmarks, which improves ID fidelity and face control. In addition, we devise an automated pipeline to construct an ID-based dataset for training HiFi-Portrait. Extensive experimental results demonstrate that our method surpasses the SOTA approaches in face similarity and controllability. Furthermore, our method is also compatible with previous SDXL-based works.
最近在基于扩散的技术方面取得了一些显著进展,特别是在保留身份的肖像生成(IPG)领域。然而,当使用来自同一ID的多张参考图像时,现有方法通常会产生较低质量的肖像,并且难以精确定制面部属性。为了解决这些问题,本文提出了一种名为HiFi-Portrait的方法,这是一种用于零样本肖像生成的高保真技术。 具体来说,我们首先引入了面部精炼器和地标生成器来获取细粒度多张人脸特征及具有3D感知的人脸地标信息。这些地标包含参考ID和目标属性的信息。然后,我们设计了HiFi-Net网络用于融合多个人脸特征,并将它们与地标对齐,这提升了身份保真度并增强了面部控制能力。 此外,我们还开发了一种自动化管道来构建基于ID的数据集,以便训练HiFi-Portrait模型。广泛的实验结果表明,我们的方法在人脸相似性和可控性方面超越了现有的最先进(SOTA)方法。而且,我们的方法还可以与之前的SDXL相关工作兼容使用。
https://arxiv.org/abs/2512.14542
The human visual environment is comprised of different surfaces that are distributed in space. The parts of a scene that are visible at any one time are governed by the occlusion of overlapping objects. In this work we consider "dead leaves" models, which replicate these occlusions when generating images by layering objects on top of each other. A dead leaves model is a generative model comprised of distributions for object position, shape, color and texture. An image is generated from a dead leaves model by sampling objects ("leaves") from these distributions until a stopping criterion is reached, usually when the image is fully covered or until a given number of leaves was sampled. Here, we describe a theoretical approach, based on previous work, to derive a Bayesian ideal observer for the partition of a given set of pixels based on independent dead leaves model distributions. Extending previous work, we provide step-by-step explanations for the computation of the posterior probability as well as describe factors that determine the feasibility of practically applying this computation. The dead leaves image model and the associated ideal observer can be applied to study segmentation decisions in a limited number of pixels, providing a principled upper-bound on performance, to which humans and vision algorithms could be compared.
人类的视觉环境由分布在空间中的不同表面组成。在任何时候,场景中可见的部分受到重叠对象遮挡的影响。在这项工作中,我们考虑了“枯叶”模型,这种模型通过将物体一层层叠加来模拟这些遮挡效果,在生成图像时复制这些现象。“枯叶”模型是一种生成模型,由物体的位置、形状、颜色和纹理的分布组成。从“枯叶”模型中生成一张图像是通过从这些分布中抽样物体(“叶子”),直到达到停止标准为止,这个标准通常是当整张图片完全覆盖或达到了指定数量的“叶子”。在这篇文章中,我们基于先前的工作描述了一种理论方法,用于推导出一种贝叶斯理想观察者模型,该模型根据独立的“枯叶”模型分布来划分一组给定像素。扩展了以前的研究成果,我们提供了逐步解释后验概率计算的方法,并且还详细说明了决定实际应用这种计算可行性的因素。“枯叶”图像模型以及相关的理想观察者可以用于研究在有限数量的像素中的分割决策问题,这为性能提供了一个理论上的上限,人类和视觉算法可以与之进行比较。
https://arxiv.org/abs/2512.05539
Diffusion models have emerged as a leading technique for generating images due to their ability to create high-resolution and realistic images. Despite their strong performance, diffusion models still struggle in managing image collections with significant feature differences. They often fail to capture complex features and produce conflicting results. Research has attempted to address this issue by learning different regions of an image through multiple diffusion paths and then combining them. However, this approach leads to inefficient coordination among multiple paths and high computational costs. To tackle these issues, this paper presents a Diffusion Fuzzy System (DFS), a latent-space multi-path diffusion model guided by fuzzy rules. DFS offers several advantages. First, unlike traditional multi-path diffusion methods, DFS uses multiple diffusion paths, each dedicated to learning a specific class of image features. By assigning each path to a different feature type, DFS overcomes the limitations of multi-path models in capturing heterogeneous image features. Second, DFS employs rule-chain-based reasoning to dynamically steer the diffusion process and enable efficient coordination among multiple paths. Finally, DFS introduces a fuzzy membership-based latent-space compression mechanism to reduce the computational costs of multi-path diffusion effectively. We tested our method on three public datasets: LSUN Bedroom, LSUN Church, and MS COCO. The results show that DFS achieves more stable training and faster convergence than existing single-path and multi-path diffusion models. Additionally, DFS surpasses baseline models in both image quality and alignment between text and images, and also shows improved accuracy when comparing generated images to target references.
扩散模型由于能够生成高分辨率和逼真的图像而成为一种领先的图像生成技术。尽管它们表现出色,但扩散模型在处理具有显著特征差异的图像集合时仍存在挑战,经常无法捕捉复杂特性并产生矛盾的结果。为了解决这个问题,研究尝试通过学习图像的不同区域并通过多个扩散路径来组合它们的方法。然而,这种方法导致了多条路径之间协调效率低下和计算成本高昂的问题。 为了应对这些挑战,本文提出了一种基于模糊规则的扩散模糊系统(DFS),这是一种在潜在空间中由模糊规则引导的多路径扩散模型。DFS具有几个优点:首先,与传统的多路径扩散方法不同,DFS使用多个独立的扩散路径来专门学习某一类图像特征。通过将每个路径分配给不同的特征类型,DFS克服了多路径模型在捕捉异质图像特征方面的局限性。 其次,DFS采用基于规则链的推理机制,以动态引导扩散过程,并使多条路径之间的协调更加高效。最后,DFS引入了一种基于模糊隶属度的潜在空间压缩机制,有效降低了多路径扩散的计算成本。我们在三个公共数据集(LSUN Bedroom、LSUN Church和MS COCO)上测试了我们的方法,结果显示DFS在训练稳定性以及收敛速度方面都优于现有的单路径和多路径扩散模型。 此外,DFS在图像质量和文本与图像之间的对齐度方面超越了基线模型,并且在将生成的图像与目标参考进行比较时也显示出更高的准确性。
https://arxiv.org/abs/2512.01533
Image generation can provide physicians with an imaging diagnosis basis in the prediction of Alzheimer's Disease (AD). Recent research has shown that long-term AD predictions by image generation often face difficulties maintaining disease-related characteristics when dealing with irregular time intervals in sequential data. Considering that the time-related aspects of the distribution can reflect changes in disease-related characteristics when images are distributed unevenly, this research proposes a model to estimate the temporal parameter within the Normal Inverse Gamma Distribution (T-NIG) to assist in generating images over the long term. The T-NIG model employs brain images from two different time points to create intermediate brain images, forecast future images, and predict the disease. T-NIG is designed by identifying features using coordinate neighborhoods. It incorporates a time parameter into the normal inverse gamma distribution to understand how features change in brain imaging sequences that have varying time intervals. Additionally, T-NIG utilizes uncertainty estimation to reduce both epistemic and aleatoric uncertainties in the model, which arise from insufficient temporal data. In particular, the T-NIG model demonstrates state-of-the-art performance in both short-term and long-term prediction tasks within the dataset. Experimental results indicate that T-NIG is proficient in forecasting disease progression while maintaining disease-related characteristics, even when faced with an irregular temporal data distribution.
https://arxiv.org/abs/2511.21057
Generative image models can produce convincingly real images, with plausible shapes, textures, layouts and lighting. However, one domain in which they perform notably poorly is in the synthesis of transparent objects, which exhibit refraction, reflection, absorption and scattering. Refraction is a particular challenge, because refracted pixel rays often intersect with surfaces observed in other parts of the image, providing a constraint on the color. It is clear from inspection that generative models have not distilled the laws of optics sufficiently well to accurately render refractive objects. In this work, we consider the problem of generating images with accurate refraction, given a text prompt. We synchronize the pixels within the object's boundary with those outside by warping and merging the pixels using Snell's Law of Refraction, at each step of the generation trajectory. For those surfaces that are not directly observed in the image, but are visible via refraction or reflection, we recover their appearance by synchronizing the image with a second generated image -- a panorama centered at the object -- using the same warping and merging procedure. We demonstrate that our approach generates much more optically-plausible images that respect the physical constraints.
https://arxiv.org/abs/2511.17340
Personalized dual-person portrait customization has considerable potential applications, such as preserving emotional memories and facilitating wedding photography planning. However, the absence of a benchmark dataset hinders the pursuit of high-quality customization in dual-person portrait generation. In this paper, we propose the PairHuman dataset, which is the first large-scale benchmark dataset specifically designed for generating dual-person portraits that meet high photographic standards. The PairHuman dataset contains more than 100K images that capture a variety of scenes, attire, and dual-person interactions, along with rich metadata, including detailed image descriptions, person localization, human keypoints, and attribute tags. We also introduce DHumanDiff, which is a baseline specifically crafted for dual-person portrait generation that features enhanced facial consistency and simultaneously balances in personalized person generation and semantic-driven scene creation. Finally, the experimental results demonstrate that our dataset and method produce highly customized portraits with superior visual quality that are tailored to human preferences. Our dataset is publicly available at this https URL.
https://arxiv.org/abs/2511.16712
We present Uni-MoE 2.0 from the Lychee family. As a fully open-source omnimodal large model (OLM), it substantially advances Lychee's Uni-MoE series in language-centric multimodal understanding, reasoning, and generating. Based on the Qwen2.5-7B dense architecture, we build Uni-MoE-2.0-Omni from scratch through three core contributions: dynamic-capacity Mixture-of-Experts (MoE) design, a progressive training strategy enhanced with an iterative reinforcement strategy, and a carefully curated multimodal data matching technique. It is capable of omnimodal understanding, as well as generating images, text, and speech. Architecturally, our new MoE framework balances computational efficiency and capability for 10 cross-modal inputs using shared, routed, and null experts, while our Omni-Modality 3D RoPE ensures spatio-temporal cross-modality alignment in the self-attention layer. For training, following cross-modal pretraining, we use a progressive supervised fine-tuning strategy that activates modality-specific experts and is enhanced by balanced data composition and an iterative GSPO-DPO method to stabilise RL training and improve reasoning. Data-wise, the base model, trained on approximately 75B tokens of open-source multimodal data, is equipped with special speech and image generation tokens, allowing it to learn these generative tasks by conditioning its outputs on linguistic cues. Extensive evaluation across 85 benchmarks demonstrates that our model achieves SOTA or highly competitive performance against leading OLMs, surpassing Qwen2.5-Omni (trained with 1.2T tokens) on over 50 of 76 benchmarks. Key strengths include video understanding (+7% avg. of 8), omnimodallity understanding (+7% avg. of 4), and audiovisual reasoning (+4%). It also advances long-form speech processing (reducing WER by 4.2%) and leads in low-level image processing and controllable generation across 5 metrics.
https://arxiv.org/abs/2511.12609
Recent advances in text-to-image models have enabled a new era of creative and controllable image generation. However, generating compositional scenes with multiple subjects and attributes remains a significant challenge. To enhance user control over subject placement, several layout-guided methods have been proposed. However, these methods face numerous challenges, particularly in compositional scenes. Unintended subjects often appear outside the layouts, generated images can be out-of-distribution and contain unnatural artifacts, or attributes bleed across subjects, leading to incorrect visual outputs. In this work, we propose MALeR, a method that addresses each of these challenges. Given a text prompt and corresponding layouts, our method prevents subjects from appearing outside the given layouts while being in-distribution. Additionally, we propose a masked, attribute-aware binding mechanism that prevents attribute leakage, enabling accurate rendering of subjects with multiple attributes, even in complex compositional scenes. Qualitative and quantitative evaluation demonstrates that our method achieves superior performance in compositional accuracy, generation consistency, and attribute binding compared to previous work. MALeR is particularly adept at generating images of scenes with multiple subjects and multiple attributes per subject.
https://arxiv.org/abs/2511.06002
Text-to-image diffusion models often exhibit degraded performance when generating images beyond their training resolution. Recent training-free methods can mitigate this limitation, but they often require substantial computation or are incompatible with recent Diffusion Transformer models. In this paper, we propose ScaleDiff, a model-agnostic and highly efficient framework for extending the resolution of pretrained diffusion models without any additional training. A core component of our framework is Neighborhood Patch Attention (NPA), an efficient mechanism that reduces computational redundancy in the self-attention layer with non-overlapping patches. We integrate NPA into an SDEdit pipeline and introduce Latent Frequency Mixing (LFM) to better generate fine details. Furthermore, we apply Structure Guidance to enhance global structure during the denoising process. Experimental results demonstrate that ScaleDiff achieves state-of-the-art performance among training-free methods in terms of both image quality and inference speed on both U-Net and Diffusion Transformer architectures.
https://arxiv.org/abs/2510.25818
Unlike existing methods that rely on source images as appearance references and use source speech to generate motion, this work proposes a novel approach that directly extracts information from the speech, addressing key challenges in speech-to-talking face. Specifically, we first employ a speech-to-face portrait generation stage, utilizing a speech-conditioned diffusion model combined with statistical facial prior and a sample-adaptive weighting module to achieve high-quality portrait generation. In the subsequent speech-driven talking face generation stage, we embed expressive dynamics such as lip movement, facial expressions, and eye movements into the latent space of the diffusion model and further optimize lip synchronization using a region-enhancement module. To generate high-resolution outputs, we integrate a pre-trained Transformer-based discrete codebook with an image rendering network, enhancing video frame details in an end-to-end manner. Experimental results demonstrate that our method outperforms existing approaches on the HDTF, VoxCeleb, and AVSpeech datasets. Notably, this is the first method capable of generating high-resolution, high-quality talking face videos exclusively from a single speech input.
https://arxiv.org/abs/2510.26819
We introduce TurboPortrait3D: a method for low-latency novel-view synthesis of human portraits. Our approach builds on the observation that existing image-to-3D models for portrait generation, while capable of producing renderable 3D representations, are prone to visual artifacts, often lack of detail, and tend to fail at fully preserving the identity of the subject. On the other hand, image diffusion models excel at generating high-quality images, but besides being computationally expensive, are not grounded in 3D and thus are not directly capable of producing multi-view consistent outputs. In this work, we demonstrate that image-space diffusion models can be used to significantly enhance the quality of existing image-to-avatar methods, while maintaining 3D-awareness and running with low-latency. Our method takes a single frontal image of a subject as input, and applies a feedforward image-to-avatar generation pipeline to obtain an initial 3D representation and corresponding noisy renders. These noisy renders are then fed to a single-step diffusion model which is conditioned on input image(s), and is specifically trained to refine the renders in a multi-view consistent way. Moreover, we introduce a novel effective training strategy that includes pre-training on a large corpus of synthetic multi-view data, followed by fine-tuning on high-quality real images. We demonstrate that our approach both qualitatively and quantitatively outperforms current state-of-the-art for portrait novel-view synthesis, while being efficient in time.
https://arxiv.org/abs/2510.23929
AutoRegressive (AR) models have demonstrated competitive performance in image generation, achieving results comparable to those of diffusion models. However, their token-by-token image generation mechanism remains computationally intensive and existing solutions such as VAR often lead to limited sample diversity. In this work, we propose a Nested AutoRegressive~(NestAR) model, which proposes nested AutoRegressive architectures in generating images. NestAR designs multi-scale modules in a hierarchical order. These different scaled modules are constructed in an AR architecture, where one larger-scale module is conditioned on outputs from its previous smaller-scale module. Within each module, NestAR uses another AR structure to generate ``patches'' of tokens. The proposed nested AR architecture reduces the overall complexity from $\mathcal{O}(n)$ to $\mathcal{O}(\log n)$ in generating $n$ image tokens, as well as increases image diversities. NestAR further incorporates flow matching loss to use continuous tokens, and develops objectives to coordinate these multi-scale modules in model training. NestAR achieves competitive image generation performance while significantly lowering computational cost.
https://arxiv.org/abs/2510.23028
Text-to-image models are known to struggle with generating images that perfectly align with textual prompts. Several previous studies have focused on evaluating image-text alignment in text-to-image generation. However, these evaluations either address overly simple scenarios, especially overlooking the difficulty of prompts with multiple different instances belonging to the same category, or they introduce metrics that do not correlate well with human evaluation. In this study, we introduce M$^3$T2IBench, a large-scale, multi-category, multi-instance, multi-relation along with an object-detection-based evaluation metric, $AlignScore$, which aligns closely with human evaluation. Our findings reveal that current open-source text-to-image models perform poorly on this challenging benchmark. Additionally, we propose the Revise-Then-Enforce approach to enhance image-text alignment. This training-free post-editing method demonstrates improvements in image-text alignment across a broad range of diffusion models. \footnote{Our code and data has been released in supplementary material and will be made publicly available after the paper is accepted.}
https://arxiv.org/abs/2510.23020
Subject-driven image generation models face a fundamental trade-off between identity preservation (fidelity) and prompt adherence (editability). While online reinforcement learning (RL), specifically GPRO, offers a promising solution, we find that a naive application of GRPO leads to competitive degradation, as the simple linear aggregation of rewards with static weights causes conflicting gradient signals and a misalignment with the temporal dynamics of the diffusion process. To overcome these limitations, we propose Customized-GRPO, a novel framework featuring two key innovations: (i) Synergy-Aware Reward Shaping (SARS), a non-linear mechanism that explicitly penalizes conflicted reward signals and amplifies synergistic ones, providing a sharper and more decisive gradient. (ii) Time-Aware Dynamic Weighting (TDW), which aligns the optimization pressure with the model's temporal dynamics by prioritizing prompt-following in the early, identity preservation in the later. Extensive experiments demonstrate that our method significantly outperforms naive GRPO baselines, successfully mitigating competitive degradation. Our model achieves a superior balance, generating images that both preserve key identity features and accurately adhere to complex textual prompts.
基于主题的图像生成模型在身份保真(fidelity)和指令遵循(editability)之间存在根本性的权衡。虽然在线强化学习(RL),特别是GPRO,提供了一个有前景的解决方案,但我们发现直接应用GRPO会导致竞争性退化,因为简单的线性奖励聚合使用静态权重导致了冲突的梯度信号,并且与扩散过程的时间动态不一致。为了解决这些问题,我们提出了Customized-GRPO,这是一个包含两个关键创新的新框架: (i) 协同感知奖励塑形(Synergy-Aware Reward Shaping, SARS):这是一种非线性机制,它明确地惩罚冲突的奖励信号并放大协同作用的信号,提供更尖锐和更具决定性的梯度。 (ii) 时间感知动态加权(Time-Aware Dynamic Weighting, TDW):通过优先考虑早期阶段遵循指令、晚期阶段保持身份特征,该方法使优化压力与模型的时间动态相一致。 广泛的实验表明,我们的方法显著优于简单的GRPO基线,成功地缓解了竞争性退化。我们的模型实现了更佳的平衡,生成出既能保留关键的身份特征又能准确遵循复杂文本提示的图像。
https://arxiv.org/abs/2510.18263
Recent generative data augmentation methods conditioned on both image and text prompts struggle to balance between fidelity and diversity, as it is challenging to preserve essential image details while aligning with varied text prompts. This challenge arises because representations in the synthesis process often become entangled with non-essential input image attributes such as environmental contexts, creating conflicts with text prompts intended to modify these elements. To address this, we propose a personalized image generation framework that uses a salient concept-aware image embedding model to reduce the influence of irrelevant visual details during the synthesis process, thereby maintaining intuitive alignment between image and text inputs. By generating images that better preserve class-discriminative features with additional controlled variations, our framework effectively enhances the diversity of training datasets and thereby improves the robustness of downstream models. Our approach demonstrates superior performance across eight fine-grained vision datasets, outperforming state-of-the-art augmentation methods with averaged classification accuracy improvements by 0.73% and 6.5% under conventional and long-tail settings, respectively.
最近的基于图像和文本提示的生成数据增强方法在保真度与多样性之间难以取得平衡,因为这些方法往往很难在保留图像关键细节的同时适应多样化的文本提示。这一挑战源于合成过程中表示的纠缠问题,即重要图像属性(如核心视觉特征)经常与其他非本质输入属性(例如环境背景信息)混杂在一起,导致与旨在修改这些元素的文本提示产生冲突。 为解决上述难题,我们提出了一种个性化图像生成框架,该框架采用显著概念感知的图像嵌入模型,减少合成过程中无关视觉细节的影响。这样一来,可以更好地保持图像和文本输入之间直观一致的关系。通过在保留类别判别性特征的同时引入额外控制变化,我们的方法能够有效增加训练数据集的多样性,并提升下游模型的鲁棒性。 实验结果表明,在八个细粒度视觉数据集中,相较于现有先进增强技术,我们的方法表现出色:常规设置下分类准确率平均提高了0.73%,长尾分布场景下则提升了6.5%。
https://arxiv.org/abs/2510.15194
Unified multimodal models integrate the reasoning capacity of large language models with both image understanding and generation, showing great promise for advanced multimodal intelligence. However, the community still lacks a rigorous reasoning-centric benchmark to systematically evaluate the alignment between understanding and generation, and their generalization potential in complex visual tasks. To this end, we introduce \textbf{GIR-Bench}, a comprehensive benchmark that evaluates unified models across three complementary perspectives. Firstly, we investigate understanding-generation consistency (GIR-Bench-UGC), asking whether models can consistently leverage the same knowledge in both understanding and generation tasks. Secondly, we investigate whether models can perform reasoning-centric text-to-image generation that requires applying logical constraints and implicit knowledge to generate faithful visual content (GIR-Bench-T2I). Thirdly, we evaluate whether models can handle multi-step reasoning in editing (GIR-Bench-Edit). For each subset, we carefully design different task-specific evaluation pipelines tailored for each task. This enables fine-grained and interpretable evaluation while mitigating biases from the prevalent MLLM-as-a-Judge paradigm. Extensive ablations over various unified models and generation-only systems have shown that: Although unified models are more capable of reasoning-driven visual tasks, they still exhibit a persistent gap between understanding and generation. The data and code for GIR-Bench are available at \href{this https URL}{this https URL}.
统一的多模态模型将大型语言模型的推理能力与图像理解和生成相结合,展示了先进多模态智能的巨大潜力。然而,社区仍然缺乏一个以推理为中心的严格基准来系统地评估理解与生成之间的对齐情况以及它们在复杂视觉任务中的泛化潜力。为此,我们引入了**GIR-Bench**这一全面基准,它从三个方面综合评价统一模型:首先,我们探讨理解和生成的一致性(GIR-Bench-UGC),即模型是否能在理解和生成任务中一致地利用相同的知识;其次,我们研究模型能否执行以推理为中心的文本到图像生成,这需要应用逻辑约束和隐含知识来生成忠实的视觉内容(GIR-Bench-T2I);最后,我们评估模型处理编辑中的多步推理的能力(GIR-Bench-Edit)。对于每个子集,我们都精心设计了不同的任务特定评估流程。这一方法使得评价更加细致入微且具有可解释性,并能减轻MLLM-as-a-Judge范式带来的偏见。 通过对各种统一模型和仅生成系统的广泛消融研究发现:尽管统一模型在推理驱动的视觉任务上更擅长,它们在理解和生成之间仍然存在持续性的差距。GIR-Bench的数据和代码可在[此处](this https URL)获取。
https://arxiv.org/abs/2510.11026
Generative AI offers vast opportunities for creating visualisations, such as graphics, videos, and images. However, recent studies around AI-generated visualisations have primarily focused on the creation process and image quality, overlooking representational biases. This study addresses this gap by testing representation biases in AI-generated pictures in an occupational setting and evaluating how two AI image generator tools, DALL-E 3 and Ideogram, compare. Additionally, the study discusses topics such as ageing and emotions in AI-generated images. As AI image tools are becoming more widely used, addressing and mitigating harmful gender biases becomes essential to ensure diverse representation in media and professional settings. In this study, over 750 AI-generated images of occupations were prompted. The thematic analysis results revealed that both DALL-E 3 and Ideogram reinforce traditional gender stereotypes in AI-generated images, although to varying degrees. These findings emphasise that AI visualisation tools risk reinforcing narrow representations. In our discussion section, we propose suggestions for practitioners, individuals and researchers to increase representation when generating images with visible genders.
生成式AI为创建可视化内容(如图形、视频和图像)提供了巨大的机会。然而,最近关于AI生成的可视化研究主要集中在创作过程和图像质量上,而忽略了代表性偏见。本研究旨在填补这一空白,通过测试在职业环境中的AI生成图片的代表性偏见,并评估两种AI图像生成工具——DALL-E 3和Ideogram之间的差异。 此外,该研究还讨论了年龄和情感等主题在AI生成图像中的表现情况。随着AI图像工具被越来越广泛地使用,解决并缓解有害性别偏见变得至关重要,以确保媒体和专业环境中具有多样化的代表性。在这项研究中,触发了超过750张不同职业的AI生成图片。主题分析结果表明,DALL-E 3和Ideogram在生成包含明确性别的图像时都会强化传统的性别刻板印象,尽管程度有所不同。 这些发现强调了AI可视化工具可能会固化狭隘的表现形式的风险。在我们的讨论部分中,我们为从业者、个人以及研究人员提出了建议,旨在提高在生成具有可见性别的图像时的代表性多样性。
https://arxiv.org/abs/2510.08628
Object hallucination in Multimodal Large Language Models (MLLMs) is a persistent failure mode that causes the model to perceive objects absent in the image. This weakness of MLLMs is currently studied using static benchmarks with fixed visual scenarios, which preempts the possibility of uncovering model-specific or unanticipated hallucination vulnerabilities. We introduce GHOST (Generating Hallucinations via Optimizing Stealth Tokens), a method designed to stress-test MLLMs by actively generating images that induce hallucination. GHOST is fully automatic and requires no human supervision or prior knowledge. It operates by optimizing in the image embedding space to mislead the model while keeping the target object absent, and then guiding a diffusion model conditioned on the embedding to generate natural-looking images. The resulting images remain visually natural and close to the original input, yet introduce subtle misleading cues that cause the model to hallucinate. We evaluate our method across a range of models, including reasoning models like GLM-4.1V-Thinking, and achieve a hallucination success rate exceeding 28%, compared to around 1% in prior data-driven discovery methods. We confirm that the generated images are both high-quality and object-free through quantitative metrics and human evaluation. Also, GHOST uncovers transferable vulnerabilities: images optimized for Qwen2.5-VL induce hallucinations in GPT-4o at a 66.5% rate. Finally, we show that fine-tuning on our images mitigates hallucination, positioning GHOST as both a diagnostic and corrective tool for building more reliable multimodal systems.
多模态大型语言模型(MLLMs)中的对象幻觉是一种持续存在的失败模式,导致模型在图像中感知到实际不存在的对象。目前,研究这种模型弱点的方法主要是通过使用静态基准测试来完成的,这些基准测试包含固定的视觉场景,这可能无法揭示特定于某个模型或未被预见的幻觉漏洞。为此,我们引入了一种名为GHOST(通过优化隐蔽标记生成幻觉)的新方法,旨在通过主动生成诱导幻觉的图像对MLLMs进行压力测试。这种方法是全自动化的,并不需要人类监督或先验知识。 GHOST的工作原理是在不包含目标对象的情况下,在图像嵌入空间中进行优化以误导模型,然后引导基于该嵌入条件的扩散模型生成看起来自然的图像。生成的图像在视觉上仍然保持与原始输入相似和高质量的状态,但会引入一些微妙的误导性线索,导致模型产生幻觉。 我们通过一系列包括推理模型(如GLM-4.1V-Thinking)在内的不同模型对我们的方法进行了评估,并且达到了超过28%的幻觉成功率,相比之下,先前基于数据驱动的方法仅为大约1%。我们还通过定量指标和人类评价确认了生成的图像是高质量且无对象存在的。 此外,GHOST揭示了一种可迁移性漏洞:为Qwen2.5-VL优化的图像在GPT-4o中诱导幻觉的成功率为66.5%。最后,我们展示了对我们的图片进行微调可以减轻幻觉现象的发生率,从而将GHOST定位为一种诊断和纠正工具,有助于构建更可靠的多模态系统。
https://arxiv.org/abs/2509.25178
Neoadjuvant chemotherapy (NAC) is a common therapy option before the main surgery for breast cancer. Response to NAC is monitored using follow-up dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI). Accurate prediction of NAC response helps with treatment planning. Here, we adopt maximum intensity projection images from DCE-MRI to generate post-treatment images (i.e., 3 or 12 weeks after NAC) from pre-treatment images leveraging the emerging diffusion model. We introduce prompt tuning to account for the known clinical factors affecting response to NAC. Our model performed better than other generative models in image quality metrics. Our model was better at generating images that reflected changes in tumor size according to pCR compared to other models. Ablation study confirmed the design choices of our method. Our study has the potential to help with precision medicine.
新辅助化疗(NAC)是乳腺癌主要手术前的常见治疗选项。通过随访的动态对比增强磁共振成像(DCE-MRI)来监测对NAC的反应。准确预测NAC的反应有助于制定治疗计划。在这里,我们采用来自DCE-MRI的最大强度投影图像,从预处理图像生成术后图像(即,NAC后的3周或12周),利用新兴的扩散模型实现这一点。我们引入提示调优来考虑已知影响NAC反应的临床因素。我们的模型在图像质量指标上优于其他生成模型,并且根据病理完全缓解(pCR)产生的肿瘤大小变化的图像比其他模型更为准确。消融研究证实了我们方法的设计选择。本研究有可能帮助实现精准医疗。
https://arxiv.org/abs/2509.24185
Designing realistic multi-object scenes requires not only generating images, but also planning spatial layouts that respect semantic relations and physical plausibility. On one hand, while recent advances in diffusion models have enabled high-quality image generation, they lack explicit spatial reasoning, leading to unrealistic object layouts. On the other hand, traditional spatial planning methods in robotics emphasize geometric and relational consistency, but they struggle to capture semantic richness in visual scenes. To bridge this gap, in this paper, we propose LayoutAgent, an agentic framework that unifies vision-language reasoning with compositional diffusion for layout generation. Given multiple input images with target objects in them, our method first employs visual-language model to preprocess the inputs through segmentation, object size estimation, scene graph construction, and prompt rewriting. Then we leverage compositional diffusion-a method traditionally used in robotics-to synthesize bounding boxes that respect object relations encoded in the scene graph for spatial layouts. In the end, a foreground-conditioned image generator composes the complete scene by rendering the objects into the planned layout guided by designed prompts. Experiments demonstrate that LayoutAgent outperforms other state-of-the-art layout generation models in layout coherence, spatial realism and aesthetic alignment.
设计逼真的多物体场景不仅需要生成图像,还需要规划符合语义关系和物理真实性的空间布局。一方面,尽管最近在扩散模型方面的进展已能够实现高质量的图像生成,但这些模型缺乏明确的空间推理能力,导致对象布局不现实。另一方面,传统的机器人领域中的空间规划方法强调几何和关联一致性,但在捕捉视觉场景中的语义丰富性方面存在困难。 为了解决这一问题,在本文中我们提出了LayoutAgent,这是一个将视觉-语言推理与组合扩散相结合的代理框架,用于生成布局。给定包含目标对象的多个输入图像后,我们的方法首先使用视觉-语言模型对这些输入进行预处理,包括分割、物体大小估计、场景图构建以及提示重写等步骤。然后,我们利用传统上在机器人领域中使用的组合扩散方法来合成符合场景图中的物体系数的空间布局边界框。最后,在设计的提示指导下,一个前景条件下的图像生成器通过将对象渲染到计划好的布局中来组成完整的场景。 实验结果表明,LayoutAgent在布局一致性、空间真实性和美学对齐方面优于其他最先进的布局生成模型。
https://arxiv.org/abs/2509.22720