Paper Reading AI Learner

JNV Corpus: A Corpus of Japanese Nonverbal Vocalizations with Diverse Phrases and Emotions

2023-05-21 12:32:03
Detai Xin, Shinnosuke Takamichi, Hiroshi Saruwatari

Abstract

We present JNV (Japanese Nonverbal Vocalizations) corpus, a corpus of Japanese nonverbal vocalizations (NVs) with diverse phrases and emotions. Existing Japanese NV corpora lack phrase or emotion diversity, which makes it difficult to analyze NVs and support downstream tasks like emotion recognition. We first propose a corpus-design method that contains two phases: (1) collecting NVs phrases based on crowd-sourcing; (2) recording NVs by stimulating speakers with emotional scenarios. We then collect $420$ audio clips from $4$ speakers that cover $6$ emotions based on the proposed method. Results of comprehensive objective and subjective experiments demonstrate that the collected NVs have high emotion recognizability and authenticity that are comparable to previous corpora of English NVs. Additionally, we analyze the distributions of vowel types in Japanese NVs. To our best knowledge, JNV is currently the largest Japanese NVs corpus in terms of phrase and emotion diversities.

Abstract (translated)

我们提出了 JNV (日本非言语语音化) corpus,一个包含多种短语和情绪的日本非言语语音化(NV) corpus。现有的日本 NV corpora 缺乏短语或情绪多样性,这使得很难分析和支持后续任务,如情绪识别。我们首先提出了一个 corpus 设计方法,其中包括两个阶段:(1)基于众包收集 NVs 的短语;(2)通过刺激演讲者以情感场景来记录 NVs。然后我们从 $4$ 名演讲者中收集了 $420$ 个音频片段,涵盖了 $6$ 种情绪。综合客观和主观实验的结果表明,收集到的 NVs 具有高情感识别度和真实性,与以前的英语 NVs corpora 相当。此外,我们分析了日本 NVs 中各种元音类型的分布情况。据我们所知,JNV 目前是日语 NVs corpus 中短语和情绪多样性最大的一份。

URL

https://arxiv.org/abs/2305.12445

PDF

https://arxiv.org/pdf/2305.12445.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot