Abstract
Which visual descriptors are suitable for multi-modal interaction and how to integrate them via real-time video data analysis into a corpus-based concatenative synthesis sound system.
Abstract (translated)
适合多模态交互的视觉描述词有哪些?如何将它们通过实时视频数据分析集成到基于语料库的串联合成音响系统?
URL
https://arxiv.org/abs/2404.10578