Abstract
Recent advances in scene-based video generation have enabled systems to synthesize coherent visual narratives from structured prompts. However, a crucial dimension of storytelling -- character-driven dialogue and speech -- remains underexplored. In this paper, we present a modular pipeline that transforms action-level prompts into visually and auditorily grounded narrative dialogue, enriching visual storytelling with natural voice and character expression. Our method takes as input a pair of prompts per scene, where the first defines the setting and the second specifies a character's behavior. While a story generation model such as Text2Story generates the corresponding visual scene, we focus on generating expressive character utterances from these prompts and the scene image. We apply a pretrained vision-language encoder to extract a high-level semantic feature from the representative frame, capturing salient visual context. This feature is then combined with the structured prompts and used to guide a large language model in synthesizing natural, character-consistent dialogue. To ensure contextual consistency across scenes, we introduce a Recursive Narrative Bank that conditions each dialogue generation on the accumulated dialogue history from prior scenes. This approach enables characters to speak in ways that reflect their evolving goals and interactions throughout a story. Finally, we render each utterance as expressive, character-consistent speech, resulting in fully-voiced video narratives. Our framework requires no additional training and demonstrates applicability across a variety of story settings, from fantasy adventures to slice-of-life episodes.
Abstract (translated)
最近在基于场景的视频生成领域的进展使得系统能够从结构化的提示中合成连贯的视觉叙述。然而,叙事中的一个关键维度——以角色驱动对话和言语——仍然相对未被充分探索。在这篇论文中,我们提出了一种模块化管道,该管道将动作级别的提示转换为基于视觉和听觉的叙述对话,从而丰富了视觉叙事,并加入了自然的声音和人物表达。我们的方法采用每场景一对输入提示作为输入,其中第一个定义背景设置,第二个指定角色的行为。虽然像Text2Story这样的故事生成模型可以产生相应的视觉场景,但我们专注于从这些提示和场景图像中生成富有表现力的对话文本。 我们应用了一个预训练的视觉-语言编码器来提取代表帧中的高层次语义特征,捕捉显著的视觉上下文。这个特征随后与结构化提示相结合,并用来指导大型语言模型合成自然且角色一致的对话。为了确保在整个故事中的场景之间保持上下文一致性,我们引入了递归叙事库,使得每一次对话生成都基于之前场景积累下来的对话历史。这种方法使角色能够以反映其不断变化的目标和互动的方式进行交谈。 最后,我们将每个语句渲染成富有表现力且符合角色的语音,从而产生完整的有声视频叙述。我们的框架无需额外训练,并展示了在各种故事设置中的适用性,包括幻想冒险和日常生活片段等场景。
URL
https://arxiv.org/abs/2505.16819