Alzheimer's Disease (AD) is a progressive neurodegenerative condition that adversely affects cognitive abilities. Language-related changes can be automatically identified through the analysis of outputs from linguistic assessment tasks, such as picture description. Language models show promise as a basis for screening tools for AD, but their limited interpretability poses a challenge in distinguishing true linguistic markers of cognitive decline from surface-level textual patterns. To address this issue, we examine how surface form variation affects classification performance, with the goal of assessing the ability of language models to represent underlying semantic indicators. We introduce a novel approach where texts surface forms are transformed by altering syntax and vocabulary while preserving semantic content. The transformations significantly modify the structure and lexical content, as indicated by low BLEU and chrF scores, yet retain the underlying semantics, as reflected in high semantic similarity scores, isolating the effect of semantic information, and finding models perform similarly to if they were using the original text, with only small deviations in macro-F1. We also investigate whether language from picture descriptions retains enough detail to reconstruct the original image using generative models. We found that image-based transformations add substantial noise reducing classification accuracy. Our methodology provides a novel way of looking at what features influence model predictions, and allows the removal of possible spurious correlations. We find that just using semantic information, language model based classifiers can still detect AD. This work shows that difficult to detect semantic impairment can be identified, addressing an overlooked feature of linguistic deterioration, and opening new pathways for early detection systems.
阿尔茨海默病(AD)是一种进行性的神经退行性疾病,会对认知能力产生不利影响。语言相关的改变可以通过分析诸如图片描述等语言评估任务的输出来自动识别。语言模型作为AD筛查工具的基础显示出潜力,但它们有限的可解释性使得区分真正的语言标志和表面文本模式变得困难。为了解决这个问题,我们研究了表层形式的变化如何影响分类性能,并旨在评估语言模型表示潜在语义指标的能力。 我们引入了一种新颖的方法,在此方法中通过改变句法和词汇来变换文本的形式,同时保留其语义内容。这些转换显著改变了结构和词汇内容,如低BLEU和chrF得分所示,但保留了底层的语义,这反映在高的语义相似性评分上,从而隔离了语义信息的影响,并发现模型的表现与使用原始文本时几乎相同,仅有微小的宏观F1值偏差。 我们还探讨了图片描述中的语言是否包含足够的细节以利用生成式模型重建原图。我们发现基于图像的变化会引入大量噪音,降低分类准确性。 我们的方法为研究哪些特征影响模型预测提供了一种新的视角,并允许消除可能存在的虚假相关性。结果表明,仅使用语义信息时,基于语言模型的分类器仍能检测到AD的存在。 这项工作展示了难以察觉的语义损伤可以被识别出来,弥补了对语言退化的一个未被重视的特点的关注,并为早期诊断系统开辟了新的途径。
https://arxiv.org/abs/2512.13685
In this paper, we present JoVA, a unified framework for joint video-audio generation. Despite recent encouraging advances, existing methods face two critical limitations. First, most existing approaches can only generate ambient sounds and lack the capability to produce human speech synchronized with lip movements. Second, recent attempts at unified human video-audio generation typically rely on explicit fusion or modality-specific alignment modules, which introduce additional architecture design and weaken the model simplicity of the original transformers. To address these issues, JoVA employs joint self-attention across video and audio tokens within each transformer layer, enabling direct and efficient cross-modal interaction without the need for additional alignment modules. Furthermore, to enable high-quality lip-speech synchronization, we introduce a simple yet effective mouth-area loss based on facial keypoint detection, which enhances supervision on the critical mouth region during training without compromising architectural simplicity. Extensive experiments on benchmarks demonstrate that JoVA outperforms or is competitive with both unified and audio-driven state-of-the-art methods in lip-sync accuracy, speech quality, and overall video-audio generation fidelity. Our results establish JoVA as an elegant framework for high-quality multimodal generation.
在这篇论文中,我们介绍了JoVA,这是一个用于联合视频-音频生成的统一框架。尽管最近取得了一些令人鼓舞的进步,现有的方法仍然面临着两个关键限制。首先,大多数现有方法只能生成环境声音,并且缺乏产生与唇部动作同步的人类语音的能力。其次,近期尝试进行统一的人体视频-音频生成的方法通常依赖于显式的融合或特定模态的对齐模块,这会引入额外的架构设计并削弱原始变压器模型的简洁性。 为了解决这些问题,JoVA在每个变压器层内通过跨视频和音频标记的联合自注意力机制来直接进行有效的跨模式交互,从而无需使用额外的对齐模块。此外,为了实现高质量的唇部语音同步,我们引入了一个基于面部关键点检测的简单而有效的口区损失函数,该函数可以在不牺牲架构简洁性的情况下增强训练过程中对关键口区域的监督。 在基准测试上的广泛实验表明,JoVA在唇部同步精度、语音质量和整体视频-音频生成保真度方面优于或可与当前最先进的统一方法和音频驱动的方法相媲美。我们的研究结果确立了JoVA作为高质量多模态生成框架的地位。
https://arxiv.org/abs/2512.13677
Forensic scientists often need to identify an unknown speaker or writer in cases such as ransom calls, covert recordings, alleged suicide notes, or anonymous online communications, among many others. Speaker recognition in the speech domain usually examines phonetic or acoustic properties of a voice, and these methods can be accurate and robust under certain conditions. However, if a speaker disguises their voice or employs text-to-speech software, vocal properties may no longer be reliable, leaving only their linguistic content available for analysis. Authorship attribution methods traditionally use syntactic, semantic, and related linguistic information to identify writers of written text (authorship attribution). In this paper, we apply a content-based authorship approach to speech that has been transcribed into text, using what a speaker says to attribute speech to individuals (speaker attribution). We introduce a stylometric method, StyloSpeaker, which incorporates character, word, token, sentence, and style features from the stylometric literature on authorship, to assess whether two transcripts were produced by the same speaker. We evaluate this method on two types of transcript formatting: one approximating prescriptive written text with capitalization and punctuation and another normalized style that removes these conventions. The transcripts' conversation topics are also controlled to varying degrees. We find generally higher attribution performance on normalized transcripts, except under the strongest topic control condition, in which overall performance is highest. Finally, we compare this more explainable stylometric model to black-box neural approaches on the same data and investigate which stylistic features most effectively distinguish speakers.
法医科学家经常需要在诸如赎金电话、秘密录音、疑似遗书或匿名在线通信等案件中识别未知的说话人或写作者。语音领域中的说话人识别通常会考察声音的发音或声学属性,这些方法在一定条件下可以达到较高的准确性和鲁棒性。然而,如果说话者伪装了自己的声音或者使用了文字转语音软件,那么仅靠音质特征可能不再可靠,此时只剩下语言内容可供分析。 传统上,写作者身份识别的方法通常利用句法、语义及相关语言信息来确认书写文本的作者(即作者归属)。在这篇论文中,我们将基于内容的作者归属方法应用于已转换成文本的语音,使用说话者所说的内容来进行语音归因。我们介绍了一种文体测量学方法StyloSpeaker,该方法结合了来自作者归属文体学文献中的字符、单词、词元、句子和风格特征,以评估两份转写记录是否由同一人产生。我们在两种类型的转写格式上对这种方法进行了测试:一种是近似规范书面文本的格式(包括大小写和标点符号),另一种则是去除了这些约定的形式化风格。同时控制了转写记录的主题内容。 我们发现,在除最强的主题控制条件之外,标准化转录的表现普遍较好;而在最强主题控制条件下,整体表现最高。最后,我们将这种更具解释性的文体测量模型与黑盒神经网络方法进行了比较,并探讨哪些文体特征最有效地区分说话人。
https://arxiv.org/abs/2512.13667
Recent codec-based language models~(LMs) have revolutionized text-to-speech~(TTS). However, since standard codecs tightly couple timbre and prosody, continuation-based LMs inevitably replicate this entanglement, hindering independent control. Recent efforts attempt to break this entanglement via codec design, but insufficient decoupling remains a critical bottleneck. To tackle this challenge, we propose DisCo-Speech, a zero-shot controllable TTS framework that enables prosody control and voice cloning via a disentangled speech codec (DisCodec) and an LM-based generator. The core component, DisCodec, contains two core stages: 1) Tri-factor disentanglement, which explicitly factorizes speech into content, prosody, and timbre subspaces via parallel encoders and hybrid losses; and 2) Fusion and reconstruction, which fuses content and prosody into unified content-prosody tokens suitable for LM prediction, while jointly optimizing reconstruction quality to resolve the disentanglement-reconstruction trade-off. With this design, the LM performs prosodic continuation from a style prompt while the decoder handles target timbre injection, enabling flexible zero-shot control. Experiments show that DisCo-Speech matches state-of-the-art voice cloning performance while outperforming baselines in zero-shot prosody control. By resolving the core entanglement at the codec level, DisCo-Speech provides a robust foundation for controllable speech synthesis. Audio samples are available at this https URL, and the code and weights will be released at the same link.
最近基于编解码器的语言模型(LMs)革新了文本到语音(TTS)技术。然而,由于标准编解码器将音色和语调紧密耦合在一起,基于续连的LM不可避免地复制这种纠缠状态,从而阻碍了独立控制。近期的努力试图通过设计新的编解码器来打破这种纠缠关系,但解耦不足仍然是一个关键瓶颈。为了解决这一挑战,我们提出了一种零样本可控TTS框架——DisCo-Speech,该框架通过一种解耦的语音编解码器(DisCodec)和基于LM的生成器实现了语调控制和声音克隆。核心组件DisCodec包含两个主要阶段:1)三因子解耦,它通过并行编码器和混合损失显式地将语音分解为内容、语调和音色子空间;2)融合与重构,在此过程中,内容和语调被融合成适合LM预测的统一内容-语调令牌,并同时优化重构质量以解决解耦与重构之间的权衡问题。通过这种设计,LM可以从风格提示中进行语调续连,而解码器负责目标音色注入,从而实现了灵活的零样本控制。实验表明,DisCo-Speech在声音克隆性能上达到了最先进的水平,并且在零样本语调控制方面超越了基线模型。通过在编解码器层面解决核心纠缠问题,DisCo-Speech为可控语音合成提供了一个稳健的基础。音频示例可在[此链接](请替换实际URL)获取,代码和权重将在同一链接发布。
https://arxiv.org/abs/2512.13251
This paper presents STARCaster, an identity-aware spatio-temporal video diffusion model that addresses both speech-driven portrait animation and free-viewpoint talking portrait synthesis, given an identity embedding or reference image, within a unified framework. Existing 2D speech-to-video diffusion models depend heavily on reference guidance, leading to limited motion diversity. At the same time, 3D-aware animation typically relies on inversion through pre-trained tri-plane generators, which often leads to imperfect reconstructions and identity drift. We rethink reference- and geometry-based paradigms in two ways. First, we deviate from strict reference conditioning at pre-training by introducing softer identity constraints. Second, we address 3D awareness implicitly within the 2D video domain by leveraging the inherent multi-view nature of video data. STARCaster adopts a compositional approach progressing from ID-aware motion modeling, to audio-visual synchronization via lip reading-based supervision, and finally to novel view animation through temporal-to-spatial adaptation. To overcome the scarcity of 4D audio-visual data, we propose a decoupled learning approach in which view consistency and temporal coherence are trained independently. A self-forcing training scheme enables the model to learn from longer temporal contexts than those generated at inference, mitigating the overly static animations common in existing autoregressive approaches. Comprehensive evaluations demonstrate that STARCaster generalizes effectively across tasks and identities, consistently surpassing prior approaches in different benchmarks.
本文介绍了STARCaster,这是一种身份感知的空间-时间视频扩散模型,它在一个统一的框架内解决了基于语音驱动的人脸动画和自由视角说话人脸合成的问题,只需要一个身份嵌入或参考图像。现有的二维语音到视频扩散模型严重依赖于参考指导,导致动作多样性有限。同时,三维意识动画通常依赖于通过预训练的三平面生成器进行逆向转换,这往往会导致不完美的重建和身份漂移。我们从两个方面重新思考了基于参考和几何的方法。首先,在预训练中偏离严格的参考条件引入更柔和的身份约束;其次,我们在二维视频域内隐式地解决了三维意识问题,利用视频数据固有的多视角特性。STARCaster采用了一种组合方法,从身份感知的动作建模开始,通过唇读监督实现音视频同步,最后通过时间到空间的适应生成新视图动画。为了克服四维视听数据稀缺的问题,我们提出了一种解耦学习方法,在这种方法中,视图一致性和时间连贯性分别独立训练。一种自我强制训练方案使模型能够从比推理时产生的更长的时间上下文中进行学习,缓解了现有自回归方法中常见的过度静态动画问题。全面的评估表明,STARCaster在不同任务和身份上有效泛化,并且在不同的基准测试中始终超越先前的方法。
https://arxiv.org/abs/2512.13247
Generating 3D-based body movements from speech shows great potential in extensive downstream applications, while it still suffers challenges in imitating realistic human movements. Predominant research efforts focus on end-to-end generation schemes to generate co-speech gestures, spanning GANs, VQ-VAE, and recent diffusion models. As an ill-posed problem, in this paper, we argue that these prevailing learning schemes fail to model crucial inter- and intra-correlations across different motion units, i.e. head, body, and hands, thus leading to unnatural movements and poor coordination. To delve into these intrinsic correlations, we propose a unified Hierarchical Implicit Periodicity (HIP) learning approach for audio-inspired 3D gesture generation. Different from predominant research, our approach models this multi-modal implicit relationship by two explicit technique insights: i) To disentangle the complicated gesture movements, we first explore the gesture motion phase manifolds with periodic autoencoders to imitate human natures from realistic distributions while incorporating non-period ones from current latent states for instance-level diversities. ii) To model the hierarchical relationship of face motions, body gestures, and hand movements, driving the animation with cascaded guidance during learning. We exhibit our proposed approach on 3D avatars and extensive experiments show our method outperforms the state-of-the-art co-speech gesture generation methods by both quantitative and qualitative evaluations. Code and models will be publicly available.
从语音生成基于3D的身体动作在广泛的下游应用中展现出巨大潜力,然而它仍然面临模仿真实人体运动的挑战。目前的研究主要集中在端到端生成方案上,以生成与言语同步的手势,涵盖了GAN、VQ-VAE以及最近的扩散模型。作为一种病态问题,在本文中我们论证了这些流行的学习方案未能充分建模不同动作单元(如头部、身体和手)之间的重要内在和外在相关性,从而导致不自然的动作和协调性差。 为了深入探究这些内在关联,我们提出了一种统一的分层隐式周期性(HIP)学习方法,用于语音启发式的3D手势生成。与主流研究不同,我们的方法通过两种明确的技术洞察来建模这种多模态的隐含关系:i) 为了解构复杂的动作模式,我们首先使用周期自动编码器探索手势运动相位流形,并从真实分布中模仿人类特性,同时结合非周期性的当前潜在状态以实现实例级别的多样性。ii) 为了模型面部、身体和手部动作之间的层级关系,在学习过程中采用级联引导来驱动动画。 我们在3D化身上演示了我们提出的方法,并通过广泛的实验展示了我们的方法在定量和定性评估中都超越了最先进的与言语同步手势生成方法的性能。代码和模型将公开发布。
https://arxiv.org/abs/2512.13131
Detecting partial deepfake speech is challenging because manipulations occur only in short regions while the surrounding audio remains authentic. However, existing detection methods are fundamentally limited by the quality of available datasets, many of which rely on outdated synthesis systems and generation procedures that introduce dataset-specific artifacts rather than realistic manipulation cues. To address this gap, we introduce HQ-MPSD, a high-quality multilingual partial deepfake speech dataset. HQ-MPSD is constructed using linguistically coherent splice points derived from fine-grained forced alignment, preserving prosodic and semantic continuity and minimizing audible and visual boundary artifacts. The dataset contains 350.8 hours of speech across eight languages and 550 speakers, with background effects added to better reflect real-world acoustic conditions. MOS evaluations and spectrogram analysis confirm the high perceptual naturalness of the samples. We benchmark state-of-the-art detection models through cross-language and cross-dataset evaluations, and all models experience performance drops exceeding 80% on HQ-MPSD. These results demonstrate that HQ-MPSD exposes significant generalization challenges once low-level artifacts are removed and multilingual and acoustic diversity are introduced, providing a more realistic and demanding benchmark for partial deepfake detection. The dataset can be found at: this https URL.
检测部分深度伪造语音的难度在于,篡改仅发生在短片段中,而周围的声音仍然是真实的。然而,现有的检测方法在根本上受限于可用数据集的质量,许多这些数据集依赖过时的合成系统和生成过程,引入的是特定于数据集的伪影而非现实中的操作线索。为解决这一缺口,我们推出了HQ-MPSD(高质量多语言部分深度伪造语音数据集)。HQ-MPSD使用通过细粒度强制对齐衍生出的语言连贯拼接点构建而成,保留了韵律和语义连续性,并最小化了听觉和视觉边界伪影。该数据集中包含来自八种语言的550位发言者的350.8小时语音,并添加了背景效果以更好地反映现实世界的声学条件。MOS评估和频谱图分析证实了样本的高度感知自然度。我们通过跨语言和跨数据集评估对最先进的检测模型进行了基准测试,所有模型在HQ-MPSD上的性能下降均超过80%。这些结果表明,当低级伪影被移除且多语言及声学多样性引入时,HQ-MPSD揭示了显著的泛化挑战,并为部分深度伪造检测提供了一个更加现实和苛刻的基准测试标准。 数据集可以在以下网址找到:[此链接](this%20https%20URL)。
https://arxiv.org/abs/2512.13012
Recent advances in self-supervised learning (SSL) on Transformers have significantly improved speaker verification (SV) by providing domain-general speech representations. However, existing approaches have underutilized the multi-layered nature of SSL encoders. To address this limitation, we propose the layer-aware time-delay neural network (L-TDNN), which directly performs layer/frame-wise processing on the layer-wise hidden state outputs from pre-trained models, extracting fixed-size speaker vectors. L-TDNN comprises a layer-aware convolutional network, a frame-adaptive layer aggregation, and attentive statistic pooling, explicitly modeling of the recognition and processing of previously overlooked layer dimension. We evaluated L-TDNN across multiple speech SSL Transformers and diverse speech-speaker corpora against other approaches for leveraging pre-trained encoders. L-TDNN consistently demonstrated robust verification performance, achieving the lowest error rates throughout the experiments. Concurrently, it stood out in terms of model compactness and exhibited inference efficiency comparable to the existing systems. These results highlight the advantages derived from the proposed layer-aware processing approach. Future work includes exploring joint training with SSL frontends and the incorporation of score calibration to further enhance state-of-the-art verification performance.
最近的自监督学习(SSL)在Transformer上的进展显著提高了说话人验证(SV)的效果,通过提供领域通用的语音表示。然而,现有的方法未能充分利用SSL编码器的多层特性。为了解决这一限制,我们提出了基于层次感知的时间延迟神经网络(L-TDNN),它直接对预训练模型各层隐藏状态输出进行逐层/逐帧处理,提取固定大小的说话人向量。L-TDNN包括一个层次感知卷积网络、帧适应性层级聚合以及注意统计池化模块,显式建模了之前被忽略的层级维度的识别和处理过程。 我们在多个语音SSL Transformer模型和多样化的语音-说话人语料库中评估了L-TDNN,并将其与利用预训练编码器的其他方法进行了比较。L-TDNN在所有实验中均表现出稳健的验证性能,实现了最低的错误率。同时,在模型紧凑性和推理效率方面也表现突出,与现有系统相当。 这些结果强调了所提出的层次感知处理方法的优势。未来的工作将探索SSL前端和评分校准的联合训练,以进一步提升最先进的验证性能。
https://arxiv.org/abs/2409.07770
Understanding videos inherently requires reasoning over both visual and auditory information. To properly evaluate Omni-Large Language Models (Omni-LLMs), which are capable of processing multi-modal information including vision and audio, an effective benchmark must comprehensively cover three key aspects: (1) multi-modal dependency (i.e., questions that cannot be answered using vision or audio alone), (2) diverse audio information types (e.g., speech, sound events), and (3) varying scene spans. However, existing datasets fall short in one or more of these dimensions, limiting strict and comprehensive evaluation. To address this gap, we introduce JointAVBench, a novel benchmark with strict audio-video correlation, spanning five cognitive dimensions, four audio information types (speech, sound events, music, vocal traits), and three scene spans (single-, cross-, and full-scene). Given the high cost of manual annotation, we propose an automated pipeline that leverages state-of-the-art vision-LLMs, audio-LLMs, and general-purpose LLMs to synthesize questions and answers that strictly require joint audio-visual understanding. We evaluate leading vision-only, audio-only, and Omni-LLMs on our dataset. Results show that even the best-performing Omni-LLM achieves an average accuracy of only 62.6\%, outperforming uni-modal baselines but revealing substantial room for improvement, especially in cross-scene reasoning.
理解视频内在地需要对视觉和听觉信息进行推理。为了全面评估能够处理包括视觉和音频在内的多模态信息的全知大型语言模型(Omni-LLM),一个有效的基准测试必须全面涵盖三个方面:(1) 多模态依赖性(即,仅靠视觉或音频无法回答的问题);(2) 丰富的音频信息类型(如语音、声音事件等);以及 (3) 不同的场景跨度。然而,现有的数据集在这几个维度上存在不足,限制了严格的全面评估。为弥补这一空白,我们引入了一个新的基准测试——JointAVBench,它具有严格的音视频关联,并涵盖五个认知层面、四种音频信息类型(语音、声音事件、音乐和声乐特征)以及三种场景跨度(单场景、跨场景和全场景)。考虑到手动注释成本高昂,我们提出了一条自动化流程,利用最先进的视觉LLM、音频LLM和通用型LLM来合成严格要求联合音视频理解的问题与答案。我们在该数据集上对仅基于视觉的模型、仅基于音频的模型以及全知LLM进行了评估。结果显示,即使表现最好的全知LLM也只达到了62.6%的平均准确率,在超过单场景推理的情景下尤其明显,这虽然超过了单一模态基线的表现,但也揭示了显著改进的空间。
https://arxiv.org/abs/2512.12772
Voice-based interaction has emerged as a natural and intuitive modality for controlling IoT devices. However, speech-driven edge devices face a fundamental trade-off between cloud-based solutions, which offer stronger language understanding capabilities at the cost of latency, connectivity dependence, and privacy concerns, and edge-based solutions, which provide low latency and improved privacy but are limited by computational constraints. This paper presents ASTA, an adaptive speech-to-action solution that dynamically routes voice commands between edge and cloud inference to balance performance and system resource utilization. ASTA integrates on-device automatic speech recognition and lightweight offline language-model inference with cloud-based LLM processing, guided by real-time system metrics such as CPU workload, device temperature, and network latency. A metric-aware routing mechanism selects the inference path at runtime, while a rule-based command validation and repair component ensures successful end-to-end command execution. We implemented our solution on an NVIDIA Jetson-based edge platform and evaluated it using a diverse dataset of 80 spoken commands. Experimental results show that ASTA successfully routes all input commands for execution, achieving a balanced distribution between online and offline inference. The system attains an ASR accuracy of 62.5% and generates executable commands without repair for only 47.5% of inputs, highlighting the importance of the repair mechanism in improving robustness. These results suggest that adaptive edge-cloud orchestration is a viable approach for resilient and resource-aware voice-controlled IoT systems.
基于语音的交互已成为控制物联网设备的一种自然且直观的方式。然而,以语音驱动的边缘设备面临着一个基本的权衡:云端解决方案提供了更强的语言理解能力,但代价是增加了延迟、依赖于网络连接和隐私问题;而边缘计算解决方案则提供低延迟和改进后的隐私保护,但由于计算资源限制而受到约束。本文提出了ASTA,这是一种自适应的语音转操作解决方案,它能够在边缘设备与云推理之间动态地路由语音命令,以平衡性能和系统资源利用。 ASTA整合了设备上的自动语音识别、轻量级离线语言模型推理以及基于云端的大规模语言模型处理,并根据实时系统指标(如CPU负载、设备温度和网络延迟)进行指导。一个具有感知能力的路由机制在运行时选择推理路径,而一个基于规则的命令验证与修复组件则确保了从端到端命令执行的成功性。 我们在NVIDIA Jetson边缘平台上实现了该解决方案,并使用包含80种口语命令的多样化数据集进行了评估。实验结果表明,ASTA成功地将所有输入命令路由至执行状态,在线推理和离线推理之间达到了平衡分布。系统实现了62.5%的自动语音识别准确率,并且只有47.5%的输入在不经过修复的情况下就能生成可执行命令,这凸显了修复机制对于提高鲁棒性的重要性。 这些结果表明,自适应边缘-云端编排是一种为具备弹性和资源意识的语音控制物联网系统提供支持的有效方法。
https://arxiv.org/abs/2512.12769
Generating realistic human motions that naturally respond to both spoken language and physical objects is crucial for interactive digital experiences. Current methods, however, address speech-driven gestures or object interactions independently, limiting real-world applicability due to a lack of integrated, comprehensive datasets. To overcome this, we introduce InteracTalker, a novel framework that seamlessly integrates prompt-based object-aware interactions with co-speech gesture generation. We achieve this by employing a multi-stage training process to learn a unified motion, speech, and prompt embedding space. To support this, we curate a rich human-object interaction dataset, formed by augmenting an existing text-to-motion dataset with detailed object interaction annotations. Our framework utilizes a Generalized Motion Adaptation Module that enables independent training, adapting to the corresponding motion condition, which is then dynamically combined during inference. To address the imbalance between heterogeneous conditioning signals, we propose an adaptive fusion strategy, which dynamically reweights the conditioning signals during diffusion sampling. InteracTalker successfully unifies these previously separate tasks, outperforming prior methods in both co-speech gesture generation and object-interaction synthesis, outperforming gesture-focused diffusion methods, yielding highly realistic, object-aware full-body motions with enhanced realism, flexibility, and control.
生成逼真的人体动作,这些动作能够自然地响应口语和物理对象,对于互动数字体验至关重要。然而,当前的方法分别处理由语音驱动的手势和与物体的交互,这限制了它们在现实世界中的应用能力,因为缺乏综合的数据集。为了解决这一问题,我们引入了一个名为InteracTalker的新框架,该框架能够无缝地将基于提示的、感知到的对象互动与伴随言语的手势生成相结合。通过采用一个多阶段训练过程来学习统一的动作、语音和提示嵌入空间,我们实现了这一点。 为了支持这个框架,我们精心策划了一个丰富的人体-物体交互数据集,它是通过对现有的文本至动作的数据集进行详细的对象交互注释扩充而形成的。我们的框架利用了一个通用的运动适应模块,该模块允许独立训练,并根据相应的运动条件进行调整,然后在推理过程中动态结合。 为了应对异构调节信号之间的不平衡问题,我们提出了一种自适应融合策略,在扩散采样期间动态重新加权调节信号。InteracTalker成功地将这些以前分离的任务统一起来,在伴随言语的手势生成和对象互动合成方面均超过了先前的方法,特别是在手势聚焦的扩散方法上表现更优。通过这种方式,InteracTalker能够生成具有增强现实感、灵活性和控制性的高度逼真且感知到物体的存在全身体动。
https://arxiv.org/abs/2512.12664
The vast majority of the world's languages, particularly creoles like Nagamese, remain severely under-resourced in Natural Language Processing (NLP), creating a significant barrier to their representation in digital technology. This paper introduces NagaNLP, a comprehensive open-source toolkit for Nagamese, bootstrapped through a novel methodology that relies on LLM-driven but human-validated synthetic data generation. We detail a multi-stage pipeline where an expert-guided LLM (Gemini) generates a candidate corpus, which is then refined and annotated by native speakers. This synthetic-hybrid approach yielded a 10K pair conversational dataset and a high-quality annotated corpus for foundational tasks. To assess the effectiveness of our methodology, we trained both discriminative and generative models. Our fine-tuned XLM-RoBERTa-base model establishes a new benchmark for Nagamese, achieving a 93.81\% accuracy (0.90 F1-Macro) on Part-of-Speech tagging and a 0.75 F1-Macro on Named Entity Recognition, massively outperforming strong zero-shot baselines. Furthermore, we fine-tuned a Llama-3.2-3B Instruct model, named NagaLLaMA, which demonstrates superior performance on conversational tasks, achieving a Perplexity of 3.85, an order of magnitude improvement over its few-shot counterpart (96.76). We release the NagaNLP toolkit, including all datasets, models, and code, providing a foundational resource for a previously underserved language and a reproducible framework for reducing data scarcity in other low-resource contexts.
世界上大多数语言,特别是像纳加梅斯语这样的克里奥尔语,在自然语言处理(NLP)方面资源严重匮乏,这在数字技术中对其表示构成了重大障碍。本文介绍了NagaNLP,这是一个针对纳加梅斯语的全面开源工具包,通过一种新颖的方法建立起来,该方法依赖于大型语言模型驱动但由人类验证的合成数据生成。 我们详细描述了一个多阶段管道,在这个过程中,一个专家指导下的大型语言模型(如Gemini)首先生成候选语料库,然后这些材料经过本地母语者的优化和标注。这种合成-混合的方法产生了一个10,000对会话数据集以及用于基础任务的高质量注释语料库。 为了评估我们的方法的有效性,我们训练了判别模型和生成模型。我们将微调后的XLM-RoBERTa-base模型作为纳加梅斯语的新基准,在词性标注上达到了93.81%的准确率(0.90 F1-Macro),在命名实体识别上的F1-Macro得分为0.75,大大超越了强大的零样本基线。此外,我们还微调了一个Llama-3.2-3B Instruct模型,并将其命名为NagaLLaMA,在对话任务中表现出了优越性能,其困惑度为3.85,相比少量示例的对照组(96.76)有了数量级的改进。 我们将发布NagaNLP工具包,包括所有数据集、模型和代码,这为以前服务不足的语言提供了一个基础资源,并为减少其他低资源环境中的数据稀缺性提供了可重复框架。
https://arxiv.org/abs/2512.12537
This work introduces a lightweight input-level adapter for the F5-TTS model that enables Romanian Language support. To preserve the existing capabilities of the model (voice cloning, English and Chinese support), we keep the original weights frozen, append a sub-network to the model and train it as an extension for the textual embedding matrix of the text encoder. For simplicity, we rely on ConvNeXt module implemented in F5-TTS to also model the co-dependencies between the new character-level embeddings. The module serves as a ``soft`` letter-to-sound layer, converting Romanian text into a continuous representation that the F5-TTS model uses to produce naturally sounding Romanian utterances. We evaluate the model with a pool of 20 human listeners across three tasks: (a) audio similarity between reference and generated speech, (b) pronunciation and naturalness and (c) Romanian-English code-switching. The results indicate that our approach maintains voice cloning capabilities and enables, to a certain extent, code-switching within the same utterance; however, residual English accent characteristics remain. We open-source our code and provide example audio samples at this https URL.
这项工作介绍了一种轻量级的输入层适配器,用于F5-TTS模型,以支持罗马尼亚语。为了保持模型现有的功能(如声音克隆、英语和中文的支持),我们冻结了原始权重,并在模型中添加了一个子网络并训练它作为文本编码器的文本嵌入矩阵的扩展。为简化起见,我们依赖于F5-TTS中实现的ConvNeXt模块来建模新字符级嵌入之间的相互依存关系。该模块充当“软”字母到声音层,将罗马尼亚文转换成连续表示形式,该形式供F5-TTS模型使用以生成自然发音的罗马尼亚语语音。 我们通过一个包含20名人类听众的评估小组,在三个任务上对模型进行了测试:(a) 参考音频与生成音频之间的相似度;(b) 发音和自然性;(c) 罗马尼亚语-英语代码切换。结果表明,我们的方法在保持声音克隆能力的同时,还能够在一定程度上支持同一句中的语言转换(即罗马尼亚语和英语之间的切换),但仍然保留了一些英式口音特征。 我们开源了我们的代码,并提供了示例音频样本,可在[此链接](https://example.com)获取。
https://arxiv.org/abs/2512.12297
Generative models are a popular choice for adult-to-adult voice conversion (VC) because of their efficient way of modelling unlabelled data. To this point their usefulness in producing children speech and in particular adult to child VC has not been investigated. For adult to child VC, four generative models are compared: diffusion model, flow based model, variational autoencoders, and generative adversarial network. Results show that although converted speech outputs produce by those models appear plausible, they exhibit insufficient similarity with the target speaker characteristics. We introduce an efficient frequency warping technique that can be applied to the output of models, and which shows significant reduction of the mismatch between adult and child. The output of all the models are evaluated using both objective and subjective measures. In particular we compare specific speaker pairing using a unique corpus collected for dubbing of children speech.
生成模型由于其高效地处理未标记数据的方式,成为成人到成人语音转换(VC)的热门选择。然而,它们在生产儿童语言以及更具体地说,在成人到儿童的语音转换中的实用性尚未得到充分研究。 对于成人到儿童的语音转换任务,本文比较了四种生成模型:扩散模型、基于流的模型、变分自编码器和生成对抗网络。结果表明,尽管这些模型产生的合成语音在听感上似乎合理,但它们与目标说话人的特征相似度不足。我们引入了一种高效的频率扭曲技术,可以应用于模型输出,显著减少了成人声音和儿童声音之间的不匹配。 所有模型的输出都使用了客观和主观评价标准进行了评估,并特别利用了一个专门收集用于为儿童语言配音的独特语料库来比较特定的说话人配对情况。
https://arxiv.org/abs/2512.12129
While voice-based AI systems have achieved remarkable generative capabilities, their interactions often feel conversationally broken. This paper examines the interactional friction that emerges in modular Speech-to-Speech Retrieval-Augmented Generation (S2S-RAG) pipelines. By analyzing a representative production system, we move beyond simple latency metrics to identify three recurring patterns of conversational breakdown: (1) Temporal Misalignment, where system delays violate user expectations of conversational rhythm; (2) Expressive Flattening, where the loss of paralinguistic cues leads to literal, inappropriate responses; and (3) Repair Rigidity, where architectural gating prevents users from correcting errors in real-time. Through system-level analysis, we demonstrate that these friction points should not be understood as defects or failures, but as structural consequences of a modular design that prioritizes control over fluidity. We conclude that building natural spoken AI is an infrastructure design challenge, requiring a shift from optimizing isolated components to carefully choreographing the seams between them.
尽管基于语音的AI系统已经实现了显著的生成能力,但它们之间的互动往往感觉在对话上存在断裂。本文探讨了模块化Speech-to-Speech Retrieval-Augmented Generation (S2S-RAG) 管道中出现的交互摩擦。通过对一个典型的生产系统的分析,我们超越了简单的延迟指标,识别出三种反复出现的对话中断模式:(1) 时间错位,即系统延迟违反了用户对对话节奏的预期;(2) 表达扁平化,即失去副语言线索导致产生过于字面且不合适的回应;以及 (3) 修复僵硬,即架构隔离阻止用户实时纠正错误。通过系统层面分析,我们证明这些摩擦点不应被视为缺陷或失败,而是优先考虑控制而非流畅性的模块化设计的结构后果。最终结论是构建自然语音AI是一个基础设施设计挑战,需要从优化孤立组件转向精心安排它们之间的接口和协调机制。
https://arxiv.org/abs/2512.11724
Generating dynamic 3D facial animation from natural language requires understanding both temporally structured semantics and fine-grained expression changes. Existing datasets and methods mainly focus on speech-driven animation or unstructured expression sequences and therefore lack the semantic grounding and temporal structures needed for expressive human performance generation. In this work, we introduce KeyframeFace, a large-scale multimodal dataset designed for text-to-animation research through keyframe-level supervision. KeyframeFace provides 2,100 expressive scripts paired with monocular videos, per-frame ARKit coefficients, contextual backgrounds, complex emotions, manually defined keyframes, and multi-perspective annotations based on ARKit coefficients and images via Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Beyond the dataset, we propose the first text-to-animation framework that explicitly leverages LLM priors for interpretable facial motion synthesis. This design aligns the semantic understanding capabilities of LLMs with the interpretable structure of ARKit's coefficients, enabling high-fidelity expressive animation. KeyframeFace and our LLM-based framework together establish a new foundation for interpretable, keyframe-guided, and context-aware text-to-animation. Code and data are available at this https URL.
从自然语言生成动态3D面部动画需要理解时间结构化的语义和细微的表情变化。现有的数据集和方法主要关注于语音驱动的动画或无序的表情序列,因此缺乏表达性人类性能生成所需的语义基础和时间结构。在这项工作中,我们引入了KeyframeFace,这是一个大规模多模态数据集,旨在通过关键帧级别的监督进行文本到动画的研究。KeyframeFace提供了2,100个具有表现力的脚本,并与单目视频、每帧ARKit系数、背景环境、复杂情绪以及人工定义的关键帧配对。此外,该数据集还包含了基于LLM(大型语言模型)和MLLM(多模态大型语言模型)的大规模注释系统,根据ARKit系数和图像进行多视角注释。 除了这个数据集之外,我们提出了第一个文本到动画框架,它明确地利用了LLM的先验知识来进行可解释的脸部运动合成。这一设计将LLM的理解能力与ARKit系数的可解释结构相结合,从而能够生成高保真的表情动画。KeyframeFace和我们的基于LLM的框架共同为可解释、关键帧引导以及背景感知的文本到动画研究奠定了新的基础。 代码和数据可以在以下链接获取:[此URL](请注意,在实际回复中提供了一个占位符URL,请替换为具体的数据集和代码发布平台的实际链接)。
https://arxiv.org/abs/2512.11321
Speech deepfake detection has been widely explored using low-level acoustic descriptors. However, each study tends to select different feature sets, making it difficult to establish a unified representation for the task. Moreover, such features are not intuitive for humans to perceive, as the distinction between bona fide and synthesized speech becomes increasingly subtle with the advancement of deepfake generation techniques. Emotion, on the other hand, remains a unique human attribute that current deepfake generator struggles to fully replicate, reflecting the gap toward true artificial general intelligence. Interestingly, many existing acoustic and semantic features have implicit correlations with emotion. For instance, speech features recognized by automatic speech recognition systems often varies naturally with emotional expression. Based on this insight, we propose a novel training framework that leverages emotion as a bridge between conventional deepfake features and emotion-oriented representations. Experiments on the widely used FakeOrReal and In-the-Wild datasets demonstrate consistent and substantial improvements in accuracy, up to approximately 6% and 2% increases, respectively, and in equal error rate (EER), showing reductions of up to about 4% and 1%, respectively, while achieving comparable results on ASVspoof2019. This approach provides a unified training strategy for all features and interpretable feature direction for deepfake detection while improving model performance through emotion-informed learning.
语音深度伪造检测已经广泛使用低级声学描述符进行了研究。然而,每项研究往往会选择不同的特征集,这使得建立统一的任务表示变得困难。此外,这些特征对于人类来说并不直观感知,因为随着深度伪造生成技术的进步,真实与合成语音之间的区别越来越难以区分。相比之下,情感是当前深度伪造生成器难以完全复制的独特的人类属性,这反映了向真正的人工通用智能的差距。有趣的是,许多现有的声学和语义特征在情绪上具有隐含的相关性。例如,自动语音识别系统识别的语音特征通常会随着情感表达而自然变化。基于这一见解,我们提出了一种新的训练框架,该框架利用情感作为传统深度伪造特征与面向情感表示之间的桥梁。在广泛使用的FakeOrReal和In-the-Wild数据集上进行的实验表明,在准确性方面分别提高了约6%和2%,而在等错误率(EER)方面则减少了多达4%和1%,同时在ASVspoof2019上的表现与现有方法相当。这种方法为所有特征提供了一种统一的训练策略,并提供了可解释的情感导向特性方向,通过情感引导的学习提高了模型性能。
https://arxiv.org/abs/2512.11241
Speech-driven talking heads have recently emerged and enable interactive avatars. However, real-world applications are limited, as current methods achieve high visual fidelity but slow or fast yet temporally unstable. Diffusion methods provide realistic image generation, yet struggle with oneshot settings. Gaussian Splatting approaches are real-time, yet inaccuracies in facial tracking, or inconsistent Gaussian mappings, lead to unstable outputs and video artifacts that are detrimental to realistic use cases. We address this problem by mapping Gaussian Splatting using 3D Morphable Models to generate person-specific avatars. We introduce transformer-based prediction of model parameters, directly from audio, to drive temporal consistency. From monocular video and independent audio speech inputs, our method enables generation of real-time talking head videos where we report competitive quantitative and qualitative performance.
基于语音驱动的虚拟头像(说话头部)技术最近发展迅速,使互动式化身成为可能。然而,现实世界中的应用仍然受到限制,因为目前的方法虽然能实现高视觉保真度,但存在速度慢或快而不稳定的问题。扩散方法能够生成逼真的图像,但在一次性的设置中表现不佳。Gaussian Splatting 方法实现了实时性能,但由于面部跟踪不准确或者高斯映射的一致性问题,会导致输出不稳定和视频伪影,这对真实使用场景不利。 为了解决这些问题,我们通过将3D Morphable Models(三维形态模型)与 Gaussian Splatting 结合起来生成特定于个人的头像。此外,我们引入了基于变压器的预测方法,直接从音频中预测模型参数,从而驱动时间上的稳定性。我们的方法可以从单目视频和独立的语音输入中生成实时说话头部视频,并在定量和定性评估上都表现出竞争力。
https://arxiv.org/abs/2512.10939
Although normalization layers have long been viewed as indispensable components of deep learning architectures, the recent introduction of Dynamic Tanh (DyT) has demonstrated that alternatives are possible. The point-wise function DyT constrains extreme values for stable convergence and reaches normalization-level performance; this work seeks further for function designs that can surpass it. We first study how the intrinsic properties of point-wise functions influence training and performance. Building on these findings, we conduct a large-scale search for a more effective function design. Through this exploration, we introduce $\mathrm{Derf}(x) = \mathrm{erf}(\alpha x + s)$, where $\mathrm{erf}(x)$ is the rescaled Gaussian cumulative distribution function, and identify it as the most performant design. Derf outperforms LayerNorm, RMSNorm, and DyT across a wide range of domains, including vision (image recognition and generation), speech representation, and DNA sequence modeling. Our findings suggest that the performance gains of Derf largely stem from its improved generalization rather than stronger fitting capacity. Its simplicity and stronger performance make Derf a practical choice for normalization-free Transformer architectures.
尽管归一化层长期以来被视为深度学习架构中的不可或缺的组成部分,但最近引入的动态Tanh(DyT)展示了一种可能性:即存在替代方案。点对点函数DyT能够限制极端值以实现稳定收敛,并且达到了与传统归一化层相当的性能水平;本研究进一步探索了可以超越它的功能设计。 首先,我们研究了点对点函数的基本属性如何影响训练过程和模型表现。基于这些发现,我们在大规模范围内搜索更有效的功能设计方案。通过这一探索,我们引入了一个新的函数$\mathrm{Derf}(x) = \mathrm{erf}(\alpha x + s)$,其中$\mathrm{erf}(x)$是重缩放的高斯累积分布函数,并将其识别为最高效的方案。 实验表明,无论是在视觉领域(图像识别和生成)、语音表示还是DNA序列建模等广泛的领域中,Derf均超过了LayerNorm、RMSNorm以及DyT的表现。我们的研究结果表明,Derf性能提升的主要原因在于其改进的泛化能力而不是更强的数据拟合能力。 由于其简单性和超越其他归一化方法的强大表现力,Derf成为一种适用于无归一化Transformer架构的实际选择。
https://arxiv.org/abs/2512.10938
Social presence is central to the enjoyment of watching content together, yet modern media consumption is increasingly solitary. We investigate whether multi-agent conversational AI systems can recreate the dynamics of shared viewing experiences across diverse content types. We present CompanionCast, a general framework for orchestrating multiple role-specialized AI agents that respond to video content using multimodal inputs, speech synthesis, and spatial audio. Distinctly, CompanionCast integrates an LLM-as-a-Judge module that iteratively scores and refines conversations across five dimensions (relevance, authenticity, engagement, diversity, personality consistency). We validate this framework through sports viewing, a domain with rich dynamics and strong social traditions, where a pilot study with soccer fans suggests that multi-agent interaction improves perceived social presence compared to solo viewing. We contribute: (1) a generalizable framework for orchestrating multi-agent conversations around multimodal video content, (2) a novel evaluator-agent pipeline for conversation quality control, and (3) exploratory evidence of increased social presence in AI-mediated co-viewing. We discuss challenges and future directions for applying this approach to diverse viewing contexts including entertainment, education, and collaborative watching experiences.
社会存在感是共同观看内容时享受体验的核心,然而现代媒体消费越来越倾向于个人化。我们研究了多代理对话人工智能系统是否能够重现不同类型内容的共享观看经历动态效果。我们介绍了CompanionCast,这是一个通过使用多媒体输入、语音合成和空间音频来响应视频内容的多个角色专业化的AI代理的通用框架。 独特的是,CompanionCast整合了一个LLM(大语言模型)作为裁判模块,该模块会迭代地评估并改进对话在五个维度上的表现(相关性、真实性、参与度、多样性以及个性一致性)。我们通过体育观看验证了这个框架的有效性——这是一个拥有丰富动态特性和强大社会传统的领域。一项针对足球迷的小规模研究表明,在与多代理系统互动时,人们感知到的社会存在感比独自观看要高。 我们的贡献包括: 1. 一种围绕多媒体视频内容组织多代理对话的通用化框架; 2. 对话质量控制的新颖评估者-代理管道; 3. 探索性证据表明AI中介的共观体验能够增强社会存在感。 我们讨论了将这种方法应用于娱乐、教育和协作观看等多样化场景中的挑战以及未来的发展方向。
https://arxiv.org/abs/2512.10918