Paper Reading AI Learner

Face-voice Association in Multilingual Environments Challenge 2024 Evaluation Plan

2024-04-14 19:51:32
Muhammad Saad Saeed, Shah Nawaz, Muhammad Salman Tahir, Rohan Kumar Das, Muhammad Zaigham Zaheer, Marta Moscati, Markus Schedl, Muhammad Haris Khan, Karthik Nandakumar, Muhammad Haroon Yousaf

Abstract

The advancements of technology have led to the use of multimodal systems in various real-world applications. Among them, the audio-visual systems are one of the widely used multimodal systems. In the recent years, associating face and voice of a person has gained attention due to presence of unique correlation between them. The Face-voice Association in Multilingual Environments (FAME) Challenge 2024 focuses on exploring face-voice association under a unique condition of multilingual scenario. This condition is inspired from the fact that half of the world's population is bilingual and most often people communicate under multilingual scenario. The challenge uses a dataset namely, Multilingual Audio-Visual (MAV-Celeb) for exploring face-voice association in multilingual environments. This report provides the details of the challenge, dataset, baselines and task details for the FAME Challenge.

Abstract (translated)

技术的进步导致各种现实应用中多模态系统的使用。其中,音频-视频系统是应用最广泛的多模态系统之一。在最近几年里,由于个人脸部和声音之间独特的相关性,人们对面部和声音的关联引起了关注。多语言环境中的Face-voice协会(FAME)挑战2024专注于探讨在独特多语言场景下脸部-声音的关联。这一条件灵感来自于世界上半数人口是双语,人们通常在多语言场景下交流的事实。挑战使用了一个数据集,即多语言音频-视频(MAV-Celeb)数据集,以探索多语言环境中的脸部-声音协会。本报告提供了FAME挑战的详细信息、数据集、基线和任务细节。

URL

https://arxiv.org/abs/2404.09342

PDF

https://arxiv.org/pdf/2404.09342.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot