Abstract
Sign Language Recognition (SLR) has garnered significant attention from researchers in recent years, particularly the intricate domain of Continuous Sign Language Recognition (CSLR), which presents heightened complexity compared to Isolated Sign Language Recognition (ISLR). One of the prominent challenges in CSLR pertains to accurately detecting the boundaries of isolated signs within a continuous video stream. Additionally, the reliance on handcrafted features in existing models poses a challenge to achieving optimal accuracy. To surmount these challenges, we propose a novel approach utilizing a Transformer-based model. Unlike traditional models, our approach focuses on enhancing accuracy while eliminating the need for handcrafted features. The Transformer model is employed for both ISLR and CSLR. The training process involves using isolated sign videos, where hand keypoint features extracted from the input video are enriched using the Transformer model. Subsequently, these enriched features are forwarded to the final classification layer. The trained model, coupled with a post-processing method, is then applied to detect isolated sign boundaries within continuous sign videos. The evaluation of our model is conducted on two distinct datasets, including both continuous signs and their corresponding isolated signs, demonstrates promising results.
Abstract (translated)
近年来,Sign Language Recognition (SLR)已经从研究者那里获得了显著的关注,尤其是 Continuous Sign Language Recognition (CSLR) 领域,它比 Isolated Sign Language Recognition (ISLR) 具有更高的复杂性。CSLR 中的一个突出挑战是准确地检测连续视频流中孤立符号的边界。此外,现有模型对手工特征的依赖使得达到最优准确性的挑战加大。为了克服这些挑战,我们提出了一个利用 Transformer 模型的全新方法。与传统模型不同,我们的方法专注于提高准确度的同时消除手工特征的需要。Transformer 模型用于 ISLR 和 CSLR。训练过程包括使用孤立手势视频,其中从输入视频中提取的手关键点特征通过 Transformer 模型进行丰富。随后,这些丰富的特征被输入到最后一层分类层。训练好的模型与后处理方法相结合,然后应用于检测连续符号中的孤立符号边界。在两个不同的数据集上评估我们的模型,包括连续符号及其相应的孤立符号,证明了积极的结果。
URL
https://arxiv.org/abs/2402.14720