Paper Reading AI Learner

Breaking the Barriers: Video Vision Transformers for Word-Level Sign Language Recognition

2025-04-10 14:27:25
Alexander Brettmann, Jakob Gr\"avinghoff, Marlene R\"uschoff, Marie Westhues

Abstract

Sign language is a fundamental means of communication for the deaf and hard-of-hearing (DHH) community, enabling nuanced expression through gestures, facial expressions, and body movements. Despite its critical role in facilitating interaction within the DHH population, significant barriers persist due to the limited fluency in sign language among the hearing population. Overcoming this communication gap through automatic sign language recognition (SLR) remains a challenge, particularly at a dynamic word-level, where temporal and spatial dependencies must be effectively recognized. While Convolutional Neural Networks have shown potential in SLR, they are computationally intensive and have difficulties in capturing global temporal dependencies between video sequences. To address these limitations, we propose a Video Vision Transformer (ViViT) model for word-level American Sign Language (ASL) recognition. Transformer models make use of self-attention mechanisms to effectively capture global relationships across spatial and temporal dimensions, which makes them suitable for complex gesture recognition tasks. The VideoMAE model achieves a Top-1 accuracy of 75.58% on the WLASL100 dataset, highlighting its strong performance compared to traditional CNNs with 65.89%. Our study demonstrates that transformer-based architectures have great potential to advance SLR, overcome communication barriers and promote the inclusion of DHH individuals.

Abstract (translated)

手语是聋人和听力障碍(DHH)社区的基本沟通方式,通过手势、面部表情和身体动作实现细致入微的表达。尽管手语在促进DHH人群之间的互动中扮演着至关重要的角色,但由于听觉人口对手语熟练度有限,仍然存在重大障碍。通过自动手语识别(SLR)来克服这一交流差距仍是一项挑战,特别是在动态词级水平上,这时必须有效识别时间和空间依赖关系。尽管卷积神经网络在SLR中显示出潜力,但它们计算成本高,并且难以捕捉视频序列之间的全局时间依赖性。为了解决这些限制,我们提出了一种基于视频的视觉变换器(ViViT)模型来进行美国手语(ASL)词级识别。变换器模型利用自注意力机制来有效捕捉空间和时间维度上的全局关系,这使其非常适合复杂的手势识别任务。VideoMAE模型在WLASL100数据集上实现了75.58%的Top-1准确率,展示了其相对于传统CNN(65.89%)的强大性能。我们的研究表明,基于变换器的架构具有极大的潜力来推进SLR技术的发展、克服交流障碍并促进DHH人士的包容性。

URL

https://arxiv.org/abs/2504.07792

PDF

https://arxiv.org/pdf/2504.07792.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot