Abstract
Since American Sign Language (ASL) has no standard written form, Deaf signers frequently share videos in order to communicate in their native language. However, since both hands and face convey critical linguistic information in signed languages, sign language videos cannot preserve signer privacy. While signers have expressed interest, for a variety of applications, in sign language video anonymization that would effectively preserve linguistic content, attempts to develop such technology have had limited success, given the complexity of hand movements and facial expressions. Existing approaches rely predominantly on precise pose estimations of the signer in video footage and often require sign language video datasets for training. These requirements prevent them from processing videos 'in the wild,' in part because of the limited diversity present in current sign language video datasets. To address these limitations, our research introduces DiffSLVA, a novel methodology that utilizes pre-trained large-scale diffusion models for zero-shot text-guided sign language video anonymization. We incorporate ControlNet, which leverages low-level image features such as HED (Holistically-Nested Edge Detection) edges, to circumvent the need for pose estimation. Additionally, we develop a specialized module dedicated to capturing facial expressions, which are critical for conveying essential linguistic information in signed languages. We then combine the above methods to achieve anonymization that better preserves the essential linguistic content of the original signer. This innovative methodology makes possible, for the first time, sign language video anonymization that could be used for real-world applications, which would offer significant benefits to the Deaf and Hard-of-Hearing communities. We demonstrate the effectiveness of our approach with a series of signer anonymization experiments.
Abstract (translated)
由于美国手语(ASL)没有标准的书面形式,聋哑人经常分享视频以沟通他们的母语。然而,由于手和脸在手语中传达关键的语言信息,手语视频无法保留发言者的隐私。尽管发言者表示有兴趣,但对于各种应用,手语视频匿名化技术在保留语言内容方面取得了有限的成功,因为手势和面部表情的复杂性。现有的方法主要依赖于对发言者在视频镜头中的精确姿态估计,通常需要手语视频数据集进行训练。这些要求使得它们无法在野中处理视频,部分原因是由于当前手语视频数据集中存在的多样性有限。为了克服这些限制,我们的研究引入了DiffSLVA,一种利用预训练的大规模扩散模型进行零散文本指导的手语视频匿名化方法。我们引入了ControlNet,利用低级图像特征如HED(全层次边缘检测)边缘,绕过了姿态估计的需求。此外,我们还开发了一个专门用于捕捉面部表情的模块,这对于在手语中传达必要语言信息至关重要。然后,我们将上述方法结合在一起,实现了对原始发言者更好地保留关键语言内容的匿名化。这种创新方法使得,对于聋哑人和听力受损者来说,首次可以实现手语视频匿名化,这将给他们带来显著的好处。我们通过一系列手语匿名化实验来证明我们方法的效力。
URL
https://arxiv.org/abs/2311.16060