Abstract
Sign language is a fundamental means of communication for the deaf and hard-of-hearing (DHH) community, enabling nuanced expression through gestures, facial expressions, and body movements. Despite its critical role in facilitating interaction within the DHH population, significant barriers persist due to the limited fluency in sign language among the hearing population. Overcoming this communication gap through automatic sign language recognition (SLR) remains a challenge, particularly at a dynamic word-level, where temporal and spatial dependencies must be effectively recognized. While Convolutional Neural Networks have shown potential in SLR, they are computationally intensive and have difficulties in capturing global temporal dependencies between video sequences. To address these limitations, we propose a Video Vision Transformer (ViViT) model for word-level American Sign Language (ASL) recognition. Transformer models make use of self-attention mechanisms to effectively capture global relationships across spatial and temporal dimensions, which makes them suitable for complex gesture recognition tasks. The VideoMAE model achieves a Top-1 accuracy of 75.58% on the WLASL100 dataset, highlighting its strong performance compared to traditional CNNs with 65.89%. Our study demonstrates that transformer-based architectures have great potential to advance SLR, overcome communication barriers and promote the inclusion of DHH individuals.
Abstract (translated)
手语是聋人和听力障碍(DHH)社区的基本沟通方式,通过手势、面部表情和身体动作实现细致入微的表达。尽管手语在促进DHH人群之间的互动中扮演着至关重要的角色,但由于听觉人口对手语熟练度有限,仍然存在重大障碍。通过自动手语识别(SLR)来克服这一交流差距仍是一项挑战,特别是在动态词级水平上,这时必须有效识别时间和空间依赖关系。尽管卷积神经网络在SLR中显示出潜力,但它们计算成本高,并且难以捕捉视频序列之间的全局时间依赖性。为了解决这些限制,我们提出了一种基于视频的视觉变换器(ViViT)模型来进行美国手语(ASL)词级识别。变换器模型利用自注意力机制来有效捕捉空间和时间维度上的全局关系,这使其非常适合复杂的手势识别任务。VideoMAE模型在WLASL100数据集上实现了75.58%的Top-1准确率,展示了其相对于传统CNN(65.89%)的强大性能。我们的研究表明,基于变换器的架构具有极大的潜力来推进SLR技术的发展、克服交流障碍并促进DHH人士的包容性。
URL
https://arxiv.org/abs/2504.07792