Paper Reading AI Learner

CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning

2023-03-22 17:59:59
Yiting Cheng, Fangyun Wei, Jianmin Bao, Dong Chen, Wenqiang Zhang

Abstract

This work focuses on sign language retrieval-a recently proposed task for sign language understanding. Sign language retrieval consists of two sub-tasks: text-to-sign-video (T2V) retrieval and sign-video-to-text (V2T) retrieval. Different from traditional video-text retrieval, sign language videos, not only contain visual signals but also carry abundant semantic meanings by themselves due to the fact that sign languages are also natural languages. Considering this character, we formulate sign language retrieval as a cross-lingual retrieval problem as well as a video-text retrieval task. Concretely, we take into account the linguistic properties of both sign languages and natural languages, and simultaneously identify the fine-grained cross-lingual (i.e., sign-to-word) mappings while contrasting the texts and the sign videos in a joint embedding space. This process is termed as cross-lingual contrastive learning. Another challenge is raised by the data scarcity issue-sign language datasets are orders of magnitude smaller in scale than that of speech recognition. We alleviate this issue by adopting a domain-agnostic sign encoder pre-trained on large-scale sign videos into the target domain via pseudo-labeling. Our framework, termed as domain-aware sign language retrieval via Cross-lingual Contrastive learning or CiCo for short, outperforms the pioneering method by large margins on various datasets, e.g., +22.4 T2V and +28.0 V2T R@1 improvements on How2Sign dataset, and +13.7 T2V and +17.1 V2T R@1 improvements on PHOENIX-2014T dataset. Code and models are available at: this https URL.

Abstract (translated)

本研究专注于 Sign Language 检索--一项最近提出的理解 sign language 的任务。 Sign Language 检索包括两个子任务:文本到Sign视频(T2V)检索和Sign视频到文本(V2T)检索。与传统的 video-text 检索不同,Sign 视频不仅包含视觉信号,而且本身携带丰富的语义含义,因为 sign 语言也是自然语言。考虑到这一特点,我们将 Sign Language 检索界定为跨语言检索问题和视频-text 检索任务。具体来说,我们考虑了 Sign 语言和自然语言的语言学特征,同时同时识别 fine-grained 跨语言映射(即 sign-to-word 映射),而在 joint embedding 空间中,同时比较文本和 Sign 视频。这一过程被称为跨语言对比学习。此外,数据稀缺问题也带来了挑战--Sign 语言数据集的规模比语音识别数据集小得多。我们通过伪标签方式将具有广泛 Sign 语言训练数据的Sign 编码器应用于目标域。我们的框架,称为跨语言对比学习的 Sign 语言检索(CiCo),在多个数据集上比先驱方法表现更好,例如,How2Sign 数据集上的 T2V 检索改进了 22.4 倍,V2T 检索改进了 28.0 倍,而 PHOENIX-2014T 数据集上的 V2T 检索改进了 13.7 倍和 17.1 倍。代码和模型可在 this https URL 中找到。

URL

https://arxiv.org/abs/2303.12793

PDF

https://arxiv.org/pdf/2303.12793.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot