Abstract
The goal of this paper is automatic character-aware subtitle generation. Given a video and a minimal amount of metadata, we propose an audio-visual method that generates a full transcript of the dialogue, with precise speech timestamps, and the character speaking identified. The key idea is to first use audio-visual cues to select a set of high-precision audio exemplars for each character, and then use these exemplars to classify all speech segments by speaker identity. Notably, the method does not require face detection or tracking. We evaluate the method over a variety of TV sitcoms, including Seinfeld, Fraiser and Scrubs. We envision this system being useful for the automatic generation of subtitles to improve the accessibility of the vast amount of videos available on modern streaming services. Project page : \url{this https URL}
Abstract (translated)
本文的目标是实现自动识别字符的标题生成。给定一个视频和少量的元数据,我们提出了一种音频-视觉方法,生成对话的完整文本,带有精确的语音时间戳和说话人身份确定的字符。关键想法是首先使用音频-视觉线索选择每个角色的精确音频示例集,然后使用这些示例来对所有语音段进行说话人身份的分类。值得注意的是,该方法不需要面部检测或跟踪。我们在《辛普森一家》、《费城永远阳光下》和《 Scrubs》等众多的情景喜剧上评估了该方法。我们想象这个系统将有助于自动生成字幕,从而提高现代流媒体服务上丰富视频的可用性。项目页面:\url{这个链接}
URL
https://arxiv.org/abs/2401.12039