Paper Reading AI Learner

I Know You're Listening: Adaptive Voice for HRI

2025-06-18 03:23:41
Paige Tutt\"os\'i

Abstract

While the use of social robots for language teaching has been explored, there remains limited work on a task-specific synthesized voices for language teaching robots. Given that language is a verbal task, this gap may have severe consequences for the effectiveness of robots for language teaching tasks. We address this lack of L2 teaching robot voices through three contributions: 1. We address the need for a lightweight and expressive robot voice. Using a fine-tuned version of Matcha-TTS, we use emoji prompting to create an expressive voice that shows a range of expressivity over time. The voice can run in real time with limited compute resources. Through case studies, we found this voice more expressive, socially appropriate, and suitable for long periods of expressive speech, such as storytelling. 2. We explore how to adapt a robot's voice to physical and social ambient environments to deploy our voices in various locations. We found that increasing pitch and pitch rate in noisy and high-energy environments makes the robot's voice appear more appropriate and makes it seem more aware of its current environment. 3. We create an English TTS system with improved clarity for L2 listeners using known linguistic properties of vowels that are difficult for these listeners. We used a data-driven, perception-based approach to understand how L2 speakers use duration cues to interpret challenging words with minimal tense (long) and lax (short) vowels in English. We found that the duration of vowels strongly influences the perception for L2 listeners and created an "L2 clarity mode" for Matcha-TTS that applies a lengthening to tense vowels while leaving lax vowels unchanged. Our clarity mode was found to be more respectful, intelligible, and encouraging than base Matcha-TTS while reducing transcription errors in these challenging tense/lax minimal pairs.

Abstract (translated)

虽然已经探索了使用社交机器人进行语言教学的应用,但在特定任务的合成语音方面用于语言教学机器人的研究仍然有限。鉴于语言是一种口头任务,这种差距可能会对机器人为语言教学任务的有效性产生严重的影响。我们通过以下三项贡献来解决这一缺乏第二语言(L2)教学机器人声音的问题: 1. 我们解决了对一种轻量级且富有表现力的机器人语音的需求。利用Matcha-TTS的微调版本,我们使用表情符号提示创建了一种具有多种时间表达能力的表现性语音。这种语音可以在有限的计算资源下实时运行。通过案例研究,我们发现这种语音更具表现力、社会适宜性和适合长期表现性演讲(如讲故事)。 2. 我们探讨了如何根据物理和社会环境调整机器人的声音,以在不同地点部署我们的语音系统。我们在嘈杂和高能量环境中增加音调和音高变化后发现,这使机器人听起来更加合适,并且似乎更能够感知其当前的环境。 3. 我们创建了一个改进了第二语言(L2)学习者清晰度的英语文本转语音(TTS)系统。我们使用已知对这些学习者来说难以掌握的元音的语言学特性,采用基于数据驱动和感知的方法来理解L2说话人如何利用持续时间线索解释具有最小紧张度(长)和松弛(短)元音的困难单词。我们发现元音的长度强烈影响了第二语言听者的认知,并为此创建了一种“L2清晰模式”,在该模式下对紧张元音进行延长,而使松弛元音保持不变。与基础Matcha-TTS相比,“L2清晰模式”被认为更加尊重人、易于理解且鼓励学习者参与,同时减少了这些具有挑战性的紧张/松弛最小对立词的转录错误。 通过这三项贡献,我们的工作旨在填补机器人语言教学领域的这一空白,并为第二语言教育提供更有效的工具。

URL

https://arxiv.org/abs/2506.15107

PDF

https://arxiv.org/pdf/2506.15107.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot