Paper Reading AI Learner

DiffSLVA: Harnessing Diffusion Models for Sign Language Video Anonymization

2023-11-27 18:26:19
Zhaoyang Xia, Carol Neidle, Dimitris N. Metaxas

Abstract

Since American Sign Language (ASL) has no standard written form, Deaf signers frequently share videos in order to communicate in their native language. However, since both hands and face convey critical linguistic information in signed languages, sign language videos cannot preserve signer privacy. While signers have expressed interest, for a variety of applications, in sign language video anonymization that would effectively preserve linguistic content, attempts to develop such technology have had limited success, given the complexity of hand movements and facial expressions. Existing approaches rely predominantly on precise pose estimations of the signer in video footage and often require sign language video datasets for training. These requirements prevent them from processing videos 'in the wild,' in part because of the limited diversity present in current sign language video datasets. To address these limitations, our research introduces DiffSLVA, a novel methodology that utilizes pre-trained large-scale diffusion models for zero-shot text-guided sign language video anonymization. We incorporate ControlNet, which leverages low-level image features such as HED (Holistically-Nested Edge Detection) edges, to circumvent the need for pose estimation. Additionally, we develop a specialized module dedicated to capturing facial expressions, which are critical for conveying essential linguistic information in signed languages. We then combine the above methods to achieve anonymization that better preserves the essential linguistic content of the original signer. This innovative methodology makes possible, for the first time, sign language video anonymization that could be used for real-world applications, which would offer significant benefits to the Deaf and Hard-of-Hearing communities. We demonstrate the effectiveness of our approach with a series of signer anonymization experiments.

Abstract (translated)

由于美国手语(ASL)没有标准的书面形式,聋哑人经常分享视频以沟通他们的母语。然而,由于手和脸在手语中传达关键的语言信息,手语视频无法保留发言者的隐私。尽管发言者表示有兴趣,但对于各种应用,手语视频匿名化技术在保留语言内容方面取得了有限的成功,因为手势和面部表情的复杂性。现有的方法主要依赖于对发言者在视频镜头中的精确姿态估计,通常需要手语视频数据集进行训练。这些要求使得它们无法在野中处理视频,部分原因是由于当前手语视频数据集中存在的多样性有限。为了克服这些限制,我们的研究引入了DiffSLVA,一种利用预训练的大规模扩散模型进行零散文本指导的手语视频匿名化方法。我们引入了ControlNet,利用低级图像特征如HED(全层次边缘检测)边缘,绕过了姿态估计的需求。此外,我们还开发了一个专门用于捕捉面部表情的模块,这对于在手语中传达必要语言信息至关重要。然后,我们将上述方法结合在一起,实现了对原始发言者更好地保留关键语言内容的匿名化。这种创新方法使得,对于聋哑人和听力受损者来说,首次可以实现手语视频匿名化,这将给他们带来显著的好处。我们通过一系列手语匿名化实验来证明我们方法的效力。

URL

https://arxiv.org/abs/2311.16060

PDF

https://arxiv.org/pdf/2311.16060.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot