Abstract
Sign languages are dynamic visual languages that involve hand gestures, in combination with non manual elements such as facial expressions. While video recordings of sign language are commonly used for education and documentation, the dynamic nature of signs can make it challenging to study them in detail, especially for new learners and educators. This work aims to convert sign language video footage into static illustrations, which serve as an additional educational resource to complement video content. This process is usually done by an artist, and is therefore quite costly. We propose a method that illustrates sign language videos by leveraging generative models' ability to understand both the semantic and geometric aspects of images. Our approach focuses on transferring a sketch like illustration style to video footage of sign language, combining the start and end frames of a sign into a single illustration, and using arrows to highlight the hand's direction and motion. While many style transfer methods address domain adaptation at varying levels of abstraction, applying a sketch like style to sign languages, especially for hand gestures and facial expressions, poses a significant challenge. To tackle this, we intervene in the denoising process of a diffusion model, injecting style as keys and values into high resolution attention layers, and fusing geometric information from the image and edges as queries. For the final illustration, we use the attention mechanism to combine the attention weights from both the start and end illustrations, resulting in a soft combination. Our method offers a cost effective solution for generating sign language illustrations at inference time, addressing the lack of such resources in educational materials.
Abstract (translated)
手语是一种动态的视觉语言,通过手势结合面部表情等非手动元素来表达。虽然视频录制的手语在教育和记录中被广泛使用,但由于其动态特性,在细节上进行研究具有挑战性,尤其是对于新手学习者和教师来说更为困难。本项目旨在将手语视频片段转化为静态插图,作为补充视频内容的教育资源。这一过程通常由艺术家完成,因此成本较高。我们提出了一种方法,利用生成模型理解图像的语义和几何方面的能力来绘制手语视频。我们的方法重点在于将类似素描的风格转移到手语视频上,并结合手势开始帧和结束帧以形成单一插图,同时使用箭头突出双手的方向和运动。 虽然许多风格迁移的方法在不同程度上解决了领域适应的问题,但对手势语言特别是对于手部动作和面部表情进行素描化处理仍然具有挑战性。为此,我们对扩散模型的去噪过程进行了干预,在高分辨率注意力层中注入样式作为键值,并融合图像和边缘中的几何信息作为查询。最后,通过注意机制结合开始帧和结束帧插图的关注权重来产生最终的插图,形成柔和组合。 我们的方法提供了一种成本效益高的解决方案,可以在推断时生成手语插图,弥补了教育材料中此类资源的不足。
URL
https://arxiv.org/abs/2504.10822