Abstract
Current sign language machine translation systems rely on recognizing hand movements, facial expressions and body postures, and natural language processing, to convert signs into text. Recent approaches use Transformer architectures to model long-range dependencies via positional encoding. However, they lack accuracy in recognizing fine-grained, short-range temporal dependencies between gestures captured at high frame rates. Moreover, their high computational complexity leads to inefficient training. To mitigate these issues, we propose an Adaptive Transformer (ADAT), which incorporates components for enhanced feature extraction and adaptive feature weighting through a gating mechanism to emphasize contextually relevant features while reducing training overhead and maintaining translation accuracy. To evaluate ADAT, we introduce MedASL, the first public medical American Sign Language dataset. In sign-to-gloss-to-text experiments, ADAT outperforms the encoder-decoder transformer, improving BLEU-4 accuracy by 0.1% while reducing training time by 14.33% on PHOENIX14T and 3.24% on MedASL. In sign-to-text experiments, it improves accuracy by 8.7% and reduces training time by 2.8% on PHOENIX14T and achieves 4.7% higher accuracy and 7.17% faster training on MedASL. Compared to encoder-only and decoder-only baselines in sign-to-text, ADAT is at least 6.8% more accurate despite being up to 12.1% slower due to its dual-stream structure.
Abstract (translated)
当前的手语机器翻译系统依赖于识别手部动作、面部表情和身体姿态,并通过自然语言处理将手势转换为文本。近期的方法采用了Transformer架构,利用位置编码来建模长距离依赖关系。然而,它们在捕捉高帧率下细微且短时间内的手势依赖关系方面缺乏准确性。此外,其计算复杂度很高,导致训练效率低下。 为了缓解这些问题,我们提出了一种自适应Transformer(ADAT),该模型通过引入增强特征提取和自适应特征加权的组件来解决这一问题,并通过门控机制强调上下文相关的特征,同时减少训练开销并保持翻译准确性。为了评估ADAT的效果,我们推出了MedASL,这是首个公开的医学美国手语数据集。 在手势到文字(经由词符)的实验中,在PHOENIX14T和MedASL上,ADAT的表现优于编码器-解码器Transformer模型,BLEU-4精度提升了0.1%,训练时间分别缩短了14.33%和3.24%。在直接手势到文字的实验中,在PHOENIX14T数据集上,ADAT提高了8.7%的准确率,并减少了2.8%的训练时间;而在MedASL上,其表现更是提升了4.7%的准确度并加快了7.17%的训练速度。 与手势到文字任务中的编码器和解码器基线相比,尽管ADAT由于其双流结构最多慢至多12.1%,但在准确性方面至少提高了6.8%。
URL
https://arxiv.org/abs/2504.11942