Large language models (LLMs) have become proficient at solving a wide variety of tasks, including those involving multi-modal inputs. In particular, instantiating an LLM (such as LLaMA) with a speech encoder and training it on paired data imparts speech recognition (ASR) abilities to the decoder-only model, hence called Speech-LLaMA. Nevertheless, due to the sequential nature of auto-regressive inference and the relatively large decoder, Speech-LLaMA models require relatively high inference time. In this work, we propose to speed up Speech-LLaMA inference by predicting multiple tokens in the same decoding step. We explore several model architectures that enable this, and investigate their performance using threshold-based and verification-based inference strategies. We also propose a prefix-based beam search decoding method that allows efficient minimum word error rate (MWER) training for such models. We evaluate our models on a variety of public benchmarks, where they reduce the number of decoder calls by ~3.2x while maintaining or improving WER performance.
大语言模型(LLMs)已经变得擅长解决各种任务,包括涉及多模态输入的任务。特别是,通过使用语音编码器启动一个LLM(如LLaMA),并在成对数据上训练它,授予了 decoder-only 模型语音识别(ASR)能力,因此称为 Speech-LLaMA。然而,由于自回归推理的序列性质以及相对较大的解码器,Speech-LLaMA 模型需要相对较长的时间进行推理。在本文中,我们提出了一种通过预测同一解码过程中多个标记点来加速 Speech-LLaMA 推理的方法。我们探讨了几种实现此功能的模型架构,并使用基于阈值和验证的基本推理策略评估了它们的性能。我们还提出了一种基于前缀的 beam search 解码方法,允许为这类模型实现高效的最低词错误率(MWER)训练。我们在各种公共基准上评估我们的模型,它们将 decoder 调用数量减少了约 3.2 倍,同时保持或提高 WER 性能。
https://arxiv.org/abs/2409.08148
Integrating named entity recognition (NER) with automatic speech recognition (ASR) can significantly enhance transcription accuracy and informativeness. In this paper, we introduce WhisperNER, a novel model that allows joint speech transcription and entity recognition. WhisperNER supports open-type NER, enabling recognition of diverse and evolving entities at inference. Building on recent advancements in open NER research, we augment a large synthetic dataset with synthetic speech samples. This allows us to train WhisperNER on a large number of examples with diverse NER tags. During training, the model is prompted with NER labels and optimized to output the transcribed utterance along with the corresponding tagged entities. To evaluate WhisperNER, we generate synthetic speech for commonly used NER benchmarks and annotate existing ASR datasets with open NER tags. Our experiments demonstrate that WhisperNER outperforms natural baselines on both out-of-domain open type NER and supervised finetuning.
将命名实体识别(NER)与自动语音识别(ASR)相结合可以显著增强转录准确性和信息性。在本文中,我们引入了WhisperNER,一种允许同时进行语音转录和实体识别的新模型。WhisperNER支持开类型NER,使得在推理过程中能够识别出多样化和演变中的实体。基于最近开放NER研究的进展,我们通过添加合成语音样本来增大合成数据集。这使我们能够在大量例子上训练WhisperNER,具有多样化的NER标签。在训练过程中,模型被提示输入NER标签,并优化为输出转录句子以及相应标签。为了评估WhisperNER,我们为常见的NER基准生成合成语音,并为现有的ASR数据集添加开放式NER标签。我们的实验结果表明,WhisperNER在开放类型NER和监督微调上都优于自然基线。
https://arxiv.org/abs/2409.08107
We introduce the Faetar Automatic Speech Recognition Benchmark, a benchmark corpus designed to push the limits of current approaches to low-resource speech recognition. Faetar, a Franco-Provençal variety spoken primarily in Italy, has no standard orthography, has virtually no existing textual or speech resources other than what is included in the benchmark, and is quite different from other forms of Franco-Provençal. The corpus comes from field recordings, most of which are noisy, for which only 5 hrs have matching transcriptions, and for which forced alignment is of variable quality. The corpus contains an additional 20 hrs of unlabelled speech. We report baseline results from state-of-the-art multilingual speech foundation models with a best phone error rate of 30.4%, using a pipeline that continues pre-training on the foundation model using the unlabelled set.
我们介绍了一个名为Faetar自动语音识别基准集,这是一个旨在推动当前低资源语音识别方法的极限的基准数据集。Faetar是一种主要在意大利讲的语言,没有标准的拼写,除了基准集包含的文本或语音资源外,几乎没有其他现成的资源,而且与其他形式的法语- Provencal有很大的不同。语料库来自现场录音,其中大多数录音环境噪音很大,只有5小时的匹配转录,而强制对齐的质量又参差不齐。语料库中还包括20小时的未标注语音。我们使用来自最先进的跨语言语音基础模型,最佳电话错误率为30.4%,使用一个在基础模型上持续预训练的管道。
https://arxiv.org/abs/2409.08103
Visual Place Recognition (VPR) is fundamental for the global re-localization of robots and devices, enabling them to recognize previously visited locations based on visual inputs. This capability is crucial for maintaining accurate mapping and localization over large areas. Given that VPR methods need to operate in real-time on embedded systems, it is critical to optimize these systems for minimal resource consumption. While the most efficient VPR approaches employ standard convolutional backbones with fixed descriptor dimensions, these often lead to redundancy in the embedding space as well as in the network architecture. Our work introduces a novel structured pruning method, to not only streamline common VPR architectures but also to strategically remove redundancies within the feature embedding space. This dual focus significantly enhances the efficiency of the system, reducing both map and model memory requirements and decreasing feature extraction and retrieval latencies. Our approach has reduced memory usage and latency by 21% and 16%, respectively, across models, while minimally impacting recall@1 accuracy by less than 1%. This significant improvement enhances real-time applications on edge devices with negligible accuracy loss.
视觉空间识别(VPR)对于机器人设备和设备的全球重新定位至关重要,它能使设备根据视觉输入识别之前访问过的位置。这种能力对于在大面积地区保持准确的映射和定位至关重要。由于VPR方法需要在嵌入式系统上实时运行,因此优化这些系统以实现最低资源消耗至关重要。虽然最有效的VPR方法采用标准的卷积后端和固定描述符维度,但往往会引起嵌入式系统在嵌入空间以及网络架构中的冗余。我们的工作介绍了一种新颖的结构化剪枝方法,不仅平滑常见的VPR架构,而且还能策略性地在特征嵌入空间中去除冗余。这种双重点 significantly增强了系统的效率,将地图和模型内存需求减少了21%,同时将特征提取和检索延迟减少了16%。我们的方法在各个模型上降低了内存使用和延迟,分别降低了21%和16%,而最小化了对recall@1准确性的影响,该影响小于1%。这种显著的改进增强了在边缘设备上实现实时的改进,同时对准确度损失的影响非常小。
https://arxiv.org/abs/2409.07834
Large Language Models (LLMs) have demonstrated substantial potential for error correction in Automatic Speech Recognition (ASR). However, most research focuses on utterances from short-duration speech recordings, which are the predominant form of speech data for supervised ASR training. This paper investigates the effectiveness of LLMs for error correction in full-text generated by ASR systems from longer speech recordings, such as transcripts from podcasts, news broadcasts, and meetings. First, we develop a Chinese dataset for full-text error correction, named ChFT, utilizing a pipeline that involves text-to-speech synthesis, ASR, and error-correction pair extractor. This dataset enables us to correct errors across contexts, including both full-text and segment, and to address a broader range of error types, such as punctuation restoration and inverse text normalization, thus making the correction process comprehensive. Second, we fine-tune a pre-trained LLM on the constructed dataset using a diverse set of prompts and target formats, and evaluate its performance on full-text error correction. Specifically, we design prompts based on full-text and segment, considering various output formats, such as directly corrected text and JSON-based error-correction pairs. Through various test settings, including homogeneous, up-to-date, and hard test sets, we find that the fine-tuned LLMs perform well in the full-text setting with different prompts, each presenting its own strengths and weaknesses. This establishes a promising baseline for further research. The dataset is available on the website.
大语言模型(LLMs)在自动语音识别(ASR)中的错误纠正潜力得到了充分展示。然而,大多数研究都集中在来自短时语音录音的语句上,这是监督ASR训练中主要的语音数据形式。本文研究了LLMs在由ASR系统生成的完整文本中的错误纠正效果。首先,我们利用涉及文本到语音合成、ASR和错误纠正对齐的流程开发了一个中文数据集,名为ChFT。这个数据集使我们能够跨越上下文(包括完整文本和片段)错误并解决更广泛的错误类型,例如标点符修复和反向文本正常化,从而使修正过程全面。其次,我们在构建的数据集上对预训练的LLM进行微调,使用一系列提示和目标格式,并评估其对完整文本错误纠正的效果。具体来说,我们根据完整文本和片段设计提示,考虑各种输出格式,如直接修正文本和基于JSON的错误纠正对。通过各种测试设置,包括同质、最新和难度测试集,我们发现微调后的LLM在不同的提示下表现良好,每种提示都有其优势和劣势。这为未来的研究奠定了有前景的基础。数据集可在网站上获取。
https://arxiv.org/abs/2409.07790
In real-world clinical settings, data distributions evolve over time, with a continuous influx of new, limited disease cases. Therefore, class incremental learning is of great significance, i.e., deep learning models are required to learn new class knowledge while maintaining accurate recognition of previous diseases. However, traditional deep neural networks often suffer from severe forgetting of prior knowledge when adapting to new data unless trained from scratch, which undesirably costs much time and computational burden. Additionally, the sample sizes for different diseases can be highly imbalanced, with newly emerging diseases typically having much fewer instances, consequently causing the classification bias. To tackle these challenges, we are the first to propose a class-incremental learning method under limited samples in the biomedical field. First, we propose a novel cumulative entropy prediction module to measure the uncertainty of the samples, of which the most uncertain samples are stored in a memory bank as exemplars for the model's later review. Furthermore, we theoretically demonstrate its effectiveness in measuring uncertainty. Second, we developed a fine-grained semantic expansion module through various augmentations, leading to more compact distributions within the feature space and creating sufficient room for generalization to new classes. Besides, a cosine classifier is utilized to mitigate classification bias caused by imbalanced datasets. Across four imbalanced data distributions over two datasets, our method achieves optimal performance, surpassing state-of-the-art methods by as much as 53.54% in accuracy.
在实际临床环境中,随着时间的推移,数据分布会发生变化,持续有新的、有限的疾病病例输入。因此,分类增量学习具有重大意义,即深度学习模型需要在保持对之前疾病准确识别的同时学习新的类知识。然而,传统的深度神经网络通常在适应新数据时容易忘记先前的知识,这无疑会花费更多的时间和计算负担。此外,不同疾病样本的样本量可能高度失衡,新兴疾病通常实例较少,从而导致分类偏差。为解决这些挑战,我们首先在生物医学领域提出了一个类增量学习方法。首先,我们提出了一个新颖的累积熵预测模块来衡量样本的不确定性,其中最不确定的一些样本被存储在内存库中作为模型的后回顾的示例。此外,我们还理论证明了它的有效性和测量不确定性的能力。其次,我们通过各种增强技术开发了一个细粒度语义扩展模块,导致在特征空间内实现更紧凑的分布,为对新类别的泛化提供足够的空间。此外,我们还使用余弦分类器来缓解不平衡数据集引起的分类偏差。在两个数据集上的四个不平衡数据分布中,我们的方法实现最佳性能,比最先进的方法提高53.54%。
https://arxiv.org/abs/2409.07757
Abnormal behavior detection, action recognition, fight and violence detection in videos is an area that has attracted a lot of interest in recent years. In this work, we propose an architecture that combines a Bidirectional Gated Recurrent Unit (BiGRU) and a 2D Convolutional Neural Network (CNN) to detect violence in video sequences. A CNN is used to extract spatial characteristics from each frame, while the BiGRU extracts temporal and local motion characteristics using CNN extracted features from multiple frames. The proposed end-to-end deep learning network is tested in three public datasets with varying scene complexities. The proposed network achieves accuracies up to 98%. The obtained results are promising and show the performance of the proposed end-to-end approach.
异常行为检测、动作识别、战斗和暴力检测是近年来引起广泛关注的一个领域。在这项工作中,我们提出了一个结合双向门控循环单元(BiGRU)和2D卷积神经网络(CNN)来检测视频序列中的暴力的架构。CNN用于从每个帧中提取空间特征,而BiGRU通过从多个帧中提取CNN提取的特征来提取时间和局部运动特征。所提出的端到端深度学习网络在三个公开数据集上的测试结果达到98%的准确率。得到的结果表明,所提出的端到端方法的性能是有前途的,证明了所提出的端到端方法的潜力。
https://arxiv.org/abs/2409.07588
Foundational models, trained on vast and diverse datasets, have demonstrated remarkable capabilities in generalizing across different domains and distributions for various zero-shot tasks. Our work addresses the challenge of retaining these powerful generalization capabilities when adapting foundational models to specific downstream tasks through fine-tuning. To this end, we introduce a novel approach we call "similarity loss", which can be incorporated into the fine-tuning process of any task. By minimizing the distortion of fine-tuned embeddings from the pre-trained embeddings, our method strikes a balance between task-specific adaptation and preserving broad generalization abilities. We evaluate our approach on two diverse tasks: image classification on satellite imagery and face recognition, focusing on open-class and domain shift scenarios to assess out-of-distribution (OOD) performance. We demonstrate that this approach significantly improves OOD performance while maintaining strong in-distribution (ID) performance.
基础模型通过训练于丰富多样的数据集表现出在不同领域和分布下的显著泛化能力。我们的工作解决了在将基础模型应用于特定下游任务时保留这些强大的泛化能力的问题。为此,我们引入了一种名为“相似性损失”的新方法,可以将其纳入任何任务的微调过程中。通过最小化微调前预训练嵌入的变形,我们的方法在任务特定适应和保留广泛的泛化能力之间取得了平衡。我们在两个具有丰富多样性的任务上评估我们的方法:卫星图像上的图像分类和面部识别,重点关注开标签和领域迁移场景以评估离散(OD)性能。我们证明了这种方法在保持强大的域内(ID)性能的同时显著提高了离散(OD)性能。
https://arxiv.org/abs/2409.07582
The emergence of text-to-image generation models has led to the recognition that image enhancement, performed as post-processing, would significantly improve the visual quality of the generated images. Exploring diffusion models to enhance the generated images nevertheless is not trivial and necessitates to delicately enrich plentiful details while preserving the visual appearance of key content in the original image. In this paper, we propose a novel framework, namely FreeEnhance, for content-consistent image enhancement using the off-the-shelf image diffusion models. Technically, FreeEnhance is a two-stage process that firstly adds random noise to the input image and then capitalizes on a pre-trained image diffusion model (i.e., Latent Diffusion Models) to denoise and enhance the image details. In the noising stage, FreeEnhance is devised to add lighter noise to the region with higher frequency to preserve the high-frequent patterns (e.g., edge, corner) in the original image. In the denoising stage, we present three target properties as constraints to regularize the predicted noise, enhancing images with high acutance and high visual quality. Extensive experiments conducted on the HPDv2 dataset demonstrate that our FreeEnhance outperforms the state-of-the-art image enhancement models in terms of quantitative metrics and human preference. More remarkably, FreeEnhance also shows higher human preference compared to the commercial image enhancement solution of Magnific AI.
文本到图像生成的模型的出现使得人们对将图像增强作为后处理来提高生成的图像的视觉质量有了更深入的认识。然而,探索扩散模型增强生成图像仍然具有挑战性,并且需要仔细丰富丰富的细节,同时保留原始图像中的关键内容视觉外观。在本文中,我们提出了一个名为FreeEnhance的新框架,用于使用标准图像扩散模型进行内容一致的图像增强。从技术上讲,FreeEnhance是一个两阶段的过程:首先向输入图像添加随机噪声,然后利用预训练的图像扩散模型(即潜在扩散模型)去噪并增强图像细节。在噪声阶段,FreeEnhance被设计为向具有更高频率的区域添加较轻的噪声,以保留原始图像中的高频模式(例如边缘,角落)。在去噪阶段,我们提出了三个目标属性作为约束,以规范预测噪声,通过高精度和高视觉质量增强图像。对HPDv2数据集的广泛实验证明,我们的FreeEnhance在定量指标和人类偏好方面超过了最先进的图像增强模型。更值得注意的是,FreeEnhance与商业图像增强解决方案Magnific AI相比,具有更高的的人类偏好。
https://arxiv.org/abs/2409.07451
To promote inclusion and ensuring effective communication for those who rely on sign language as their main form of communication, sign language recognition (SLR) is crucial. Sign language recognition (SLR) seamlessly incorporates with diverse technology, enhancing accessibility for the deaf community by facilitating their use of digital platforms, video calls, and communication devices. To effectively solve this problem, we suggest a novel solution that uses a deep neural network to fully automate sign language recognition. This methodology integrates sophisticated preprocessing methodologies to optimise the overall performance. The architectures resnet, inception, xception, and vgg are utilised to selectively categorise images of sign language. We prepared a DNN architecture and merged it with the pre-processing architectures. In the post-processing phase, we utilised the SHAP deep explainer, which is based on cooperative game theory, to quantify the influence of specific features on the output of a machine learning model. Bhutanese-Sign-Language (BSL) dataset was used for training and testing the suggested technique. While training on Bhutanese-Sign-Language (BSL) dataset, overall ResNet50 with the DNN model performed better accuracy which is 98.90%. Our model's ability to provide informational clarity was assessed using the SHAP (SHapley Additive exPlanations) method. In part to its considerable robustness and reliability, the proposed methodological approach can be used to develop a fully automated system for sign language recognition.
为了促进依赖手语为主要交流方式的人的 inclusion 并确保有效的沟通,手语识别(SLR)至关重要。SLR 无缝地融入了各种技术,通过促进他们使用数字平台、视频通话和通信设备,提高了聋人群体的可访问性。要有效地解决这个问题,我们提出了一个新方法,该方法使用深度神经网络完全自动化手语识别。这种方法结合了复杂的预处理方法来优化整体性能。架构 resnet、inception、xception 和 vgg 用于选择性地分类手语图像。我们准备了一个 DNN 架构,并将其与预处理架构合并。在后处理阶段,我们利用了基于合作游戏理论的 SHAP 深度解释器,来量化特定特征对手轮输出影响。Bhutanese-Sign-Language(BSL)数据集用于训练和测试所建议的技术。在训练 Bhutanese-Sign-Language(BSL)数据集时,使用 DNN 模型的整体 ResNet50 的准确性为 98.90%。通过 SHAP(SHapley Additive exPlanations)方法评估了我们的模型的信息清晰度。由于其巨大的可靠性和稳健性,所提出的方法论可以用于开发完全自动化的手语识别系统。
https://arxiv.org/abs/2409.07426
Multimodal affective computing (MAC) has garnered increasing attention due to its broad applications in analyzing human behaviors and intentions, especially in text-dominated multimodal affective computing field. This survey presents the recent trends of multimodal affective computing from NLP perspective through four hot tasks: multimodal sentiment analysis, multimodal emotion recognition in conversation, multimodal aspect-based sentiment analysis and multimodal multi-label emotion recognition. The goal of this survey is to explore the current landscape of multimodal affective research, identify development trends, and highlight the similarities and differences across various tasks, offering a comprehensive report on the recent progress in multimodal affective computing from an NLP perspective. This survey covers the formalization of tasks, provides an overview of relevant works, describes benchmark datasets, and details the evaluation metrics for each task. Additionally, it briefly discusses research in multimodal affective computing involving facial expressions, acoustic signals, physiological signals, and emotion causes. Additionally, we discuss the technical approaches, challenges, and future directions in multimodal affective computing. To support further research, we released a repository that compiles related works in multimodal affective computing, providing detailed resources and references for the community.
多模态情感计算(MAC)因其在分析人类行为和意图方面的广泛应用而受到越来越多的关注,尤其是在以文本为主的多模态情感计算领域。这项调查从自然语言处理(NLP)的角度呈现了多模态情感计算领域的最近趋势,通过四个热门任务:多模态情感分析、对话中的多模态情感识别、基于多模态的情感分析和多模态多标签情感识别。本次调查的目的是探索多模态情感研究的现状,确定发展趋势,并强调不同任务之间的相似和差异,为从NLP角度全面报告多模态情感计算的最近进展提供一份全面的报告。本次调查涵盖了任务的正式化、相关研究概述、基准数据集以及每个任务的评估指标。此外,还简要讨论了涉及面部表情、音频信号、生理信号和情感原因的多模态情感计算研究。最后,我们讨论了多模态情感计算的技术方法、挑战和未来方向。为了支持进一步的研究,我们发布了一个仓库,汇集了与多模态情感计算相关的论文,为社区提供了详细的资源和参考文献。
https://arxiv.org/abs/2409.07388
Interactive artificial intelligence in the motion control field is an interesting topic, especially when universal knowledge is adaptive to multiple tasks and universal environments. Despite there being increasing efforts in the field of Reinforcement Learning (RL) with the aid of transformers, most of them might be limited by the offline training pipeline, which prohibits exploration and generalization abilities. To address this limitation, we propose the framework of Online Decision MetaMorphFormer (ODM) which aims to achieve self-awareness, environment recognition, and action planning through a unified model architecture. Motivated by cognitive and behavioral psychology, an ODM agent is able to learn from others, recognize the world, and practice itself based on its own experience. ODM can also be applied to any arbitrary agent with a multi-joint body, located in different environments, and trained with different types of tasks using large-scale pre-trained datasets. Through the use of pre-trained datasets, ODM can quickly warm up and learn the necessary knowledge to perform the desired task, while the target environment continues to reinforce the universal policy. Extensive online experiments as well as few-shot and zero-shot environmental tests are used to verify ODM's performance and generalization ability. The results of our study contribute to the study of general artificial intelligence in embodied and cognitive fields. Code, results, and video examples can be found on the website \url{this https URL}.
交互式人工智能在运动控制领域是一个有趣的话题,尤其是在实现通用知识和适应多个任务和通用环境时。尽管在强化学习(RL)领域随着Transformer等模型的帮助,投入了越来越多的精力,但大多数模型可能受到离线训练流程的限制,这禁止探索和泛化能力。为了解决这个问题,我们提出了在线决策元学习(ODM)框架,旨在通过统一的模型架构实现自意识、环境识别和动作规划。受到认知和行为心理学的影响,一个ODM代理能够从他人学习,认识世界,并基于自己的经验练习行动。ODM还可以应用于任何具有多关节身体、位于不同环境的任意代理,以及使用大型预训练数据集以不同类型任务进行训练。通过使用预训练数据集,ODM可以快速预热并学习执行所需的知识,而目标环境继续强化通用策略。我们通过大量的在线实验以及少样本和零样本环境测试来验证ODM的性能和泛化能力。我们研究的结果对一般人工智能在身体和认知领域的研究做出了贡献。代码、结果和视频示例可以在网站上 \url{this <https://this https URL>}找到。
https://arxiv.org/abs/2409.07341
Hand pose estimation from egocentric video has broad implications across various domains, including human-computer interaction, assistive technologies, activity recognition, and robotics, making it a topic of significant research interest. The efficacy of modern machine learning models depends on the quality of data used for their training. Thus, this work is devoted to the analysis of state-of-the-art egocentric datasets suitable for 2D hand pose estimation. We propose a novel protocol for dataset evaluation, which encompasses not only the analysis of stated dataset characteristics and assessment of data quality, but also the identification of dataset shortcomings through the evaluation of state-of-the-art hand pose estimation models. Our study reveals that despite the availability of numerous egocentric databases intended for 2D hand pose estimation, the majority are tailored for specific use cases. There is no ideal benchmark dataset yet; however, H2O and GANerated Hands datasets emerge as the most promising real and synthetic datasets, respectively.
从以自指视频进行手部姿势估计对许多领域都产生了广泛的影响,包括人机交互、辅助技术、活动识别和机器人技术,使其成为一个重要的研究兴趣话题。现代机器学习模型的有效性取决于用于其训练的数据质量。因此,本文致力于分析适用于2D手部姿势估计的最新自指视频数据集。我们提出了一个新颖的数据集评估协议,不仅包括对数据集特征的分析和对数据质量的评估,还包括通过评估最先进的2D手部姿势估计模型来识别数据集缺陷。我们的研究显示,尽管有许多旨在用于2D手部姿势估计的自指视频数据库可用,但大多数都是为特定应用场景而设计的。目前还没有理想的基准数据集;然而,H2O和GAN生成的手部数据集脱颖而出,分别成为最具有前景的实时和合成数据集。
https://arxiv.org/abs/2409.07337
In this study, we introduce ManaTTS, the most extensive publicly accessible single-speaker Persian corpus, and a comprehensive framework for collecting transcribed speech datasets for the Persian language. ManaTTS, released under the open CC-0 license, comprises approximately 86 hours of audio with a sampling rate of 44.1 kHz. Alongside ManaTTS, we also generated the VirgoolInformal dataset to evaluate Persian speech recognition models used for forced alignment, extending over 5 hours of audio. The datasets are supported by a fully transparent, MIT-licensed pipeline, a testament to innovation in the field. It includes unique tools for sentence tokenization, bounded audio segmentation, and a novel forced alignment method. This alignment technique is specifically designed for low-resource languages, addressing a crucial need in the field. With this dataset, we trained a Tacotron2-based TTS model, achieving a Mean Opinion Score (MOS) of 3.76, which is remarkably close to the MOS of 3.86 for the utterances generated by the same vocoder and natural spectrogram, and the MOS of 4.01 for the natural waveform, demonstrating the exceptional quality and effectiveness of the corpus.
在这项研究中,我们引入了ManaTTS,这是最广泛公开可用的波斯语单人演讲数据集,以及一个用于收集波斯语转录语音数据的全套框架。ManaTTS遵循开源的CC-0许可证发布,包含大约86小时音频,采样率为44.1 kHz。与ManaTTS一起,我们还生成了VirgoolInformal数据集,用于评估用于强制对齐的波斯语语音识别模型,超过了5小时的音频。这些数据集由一个完全透明、MIT许可证支持的全套流程支持,这是该领域创新的证明。它包括用于分词、有界音频分段和一种新颖的强制对齐方法的独特工具。这种对齐技术专门设计用于资源较低的 languages,解决了该领域的一个关键需求。使用这个数据集,我们训练了一个基于Tacotron2的TTS模型,实现了平均意见评分(MOS)为3.76,与由同一语音编码器和自然频谱产生的语句的MOS非常接近,以及自然波形的MOS为4.01,证明了数据集的优异质量和有效性。
https://arxiv.org/abs/2409.07259
In the current landscape of biometrics and surveillance, the ability to accurately recognize faces in uncontrolled settings is paramount. The Watchlist Challenge addresses this critical need by focusing on face detection and open-set identification in real-world surveillance scenarios. This paper presents a comprehensive evaluation of participating algorithms, using the enhanced UnConstrained College Students (UCCS) dataset with new evaluation protocols. In total, four participants submitted four face detection and nine open-set face recognition systems. The evaluation demonstrates that while detection capabilities are generally robust, closed-set identification performance varies significantly, with models pre-trained on large-scale datasets showing superior performance. However, open-set scenarios require further improvement, especially at higher true positive identification rates, i.e., lower thresholds.
在当前的人脸识别和监视领域,在未受控的环境中准确识别脸部至关重要。Watchlist挑战通过专注于现实主义监视场景中的人脸检测和开箱即用识别来解决这一关键需求。本文全面评估了参赛算法的性能,使用增强的UCCS数据集及其新的评估协议。总共四名参赛者提交了四个面部检测和九个开箱即用的人脸识别系统。评估表明,虽然检测能力通常很强,但开箱即用识别性能差异很大,在预训练于大型数据集的模型上表现优异。然而,开箱即用场景需要进一步改进,尤其是在更高的真阳性识别率上,即较低的阈值。
https://arxiv.org/abs/2409.07220
This paper presents LiteVSR2, an enhanced version of our previously introduced efficient approach to Visual Speech Recognition (VSR). Building upon our knowledge distillation framework from a pre-trained Automatic Speech Recognition (ASR) model, we introduce two key improvements: a stabilized video preprocessing technique and feature normalization in the distillation process. These improvements yield substantial performance gains on the LRS2 and LRS3 benchmarks, positioning LiteVSR2 as the current best CTC-based VSR model without increasing the volume of training data or computational resources utilized. Furthermore, we explore the scalability of our approach by examining performance metrics across varying model complexities and training data volumes. LiteVSR2 maintains the efficiency of its predecessor while significantly enhancing accuracy, thereby demonstrating the potential for resource-efficient advancements in VSR technology.
本文介绍了LiteVSR2,是我们之前介绍的针对视觉语音识别(VSR)的高效方法的一个增强版本。我们在一个预训练的自动语音识别(ASR)模型之上构建了知识蒸馏框架,并引入了两个关键改进:稳定视频预处理技术和在蒸馏过程中特征归一化。这些改进在LRS2和LRS3基准测试上产生了实质性的性能提升,将LiteVSR2定位为没有增加训练数据或计算资源的开源VSR模型中的佼佼者。此外,我们通过研究不同模型复杂度和训练数据量来评估我们的方法的扩展性。LiteVSR2在保持其前辈的效率的同时,显著增强了准确性,从而展示了在VSR技术上实现资源高效改进的潜力。
https://arxiv.org/abs/2409.07210
Automatic speech recognition (ASR) with an encoder equipped with self-attention, whether streaming or non-streaming, takes quadratic time in the length of the speech utterance. This slows down training and decoding, increase their cost, and limit the deployment of the ASR in constrained devices. SummaryMixing is a promising linear-time complexity alternative to self-attention for non-streaming speech recognition that, for the first time, preserves or outperforms the accuracy of self-attention models. Unfortunately, the original definition of SummaryMixing is not suited to streaming speech recognition. Hence, this work extends SummaryMixing to a Conformer Transducer that works in both a streaming and an offline mode. It shows that this new linear-time complexity speech encoder outperforms self-attention in both scenarios while requiring less compute and memory during training and decoding.
自动语音识别(ASR)使用配备自注意力的编码器,无论是流式还是非流式,在语音陈述长度上需要平方时间。这会减缓训练和解码速度,增加它们的成本,并限制ASR在受约束设备上的部署。SummaryMixing是一种有前途的线性时间复杂度的非流式语音识别替代方案,第一次在保持或超越自注意力模型的准确性的同时。然而,原SummaryMixing定义并不适合流式语音识别。因此,本文将SummaryMixing扩展到一种可同时进行流式和离线操作的Transformer解码器中。它证明了这种新的线性时间复杂度的语音编码器在两种情况下都优于自注意力,同时训练和解码期间需要更少的计算和内存。
https://arxiv.org/abs/2409.07165
Biometric authentication has garnered significant attention as a secure and efficient method of identity verification. Among the various modalities, hand vein biometrics, including finger vein, palm vein, and dorsal hand vein recognition, offer unique advantages due to their high accuracy, low susceptibility to forgery, and non-intrusiveness. The vein patterns within the hand are highly complex and distinct for each individual, making them an ideal biometric identifier. Additionally, hand vein recognition is contactless, enhancing user convenience and hygiene compared to other modalities such as fingerprint or iris recognition. Furthermore, the veins are internally located, rendering them less susceptible to damage or alteration, thus enhancing the security and reliability of the biometric system. The combination of these factors makes hand vein biometrics a highly effective and secure method for identity verification. This review paper delves into the latest advancements in deep learning techniques applied to finger vein, palm vein, and dorsal hand vein recognition. It encompasses all essential fundamentals of hand vein biometrics, summarizes publicly available datasets, and discusses state-of-the-art metrics used for evaluating the three modes. Moreover, it provides a comprehensive overview of suggested approaches for finger, palm, dorsal, and multimodal vein techniques, offering insights into the best performance achieved, data augmentation techniques, and effective transfer learning methods, along with associated pretrained deep learning models. Additionally, the review addresses research challenges faced and outlines future directions and perspectives, encouraging researchers to enhance existing methods and propose innovative techniques.
生物识别作为一种安全高效的的身份验证方法已经引起了广泛关注。在各种模式中,包括手指静脉、手掌静脉和桡动脉静脉识别,由于其高准确度、低伪造倾向和非侵入性,手部静脉模式具有独特的优势。此外,手部静脉识别是接触less的,比其他识别模式如指纹或虹膜识别更方便用户,同时提高了系统的可靠性和安全性。再者,静脉内部位置,使它们对损坏或改变的影响较小,从而提高了生物识别系统的安全性和可靠性。这些因素的结合使手部静脉生物识别成为一种非常有效且安全的方法。 本文回顾了在手指静脉、手掌静脉和桡动脉静脉识别中应用的深度学习技术的最新进展。它涵盖了所有手部静脉生物识别的基本原理,总结了已公开的数据集,并讨论了用于评估三种模式的现有最先进的指标。此外,它还提供了关于手指、手掌、桡动脉和多模态静脉技术的建议方法,以及实现最佳性能、数据增强技术和有效迁移学习方法的相关信息,以及相关的预训练深度学习模型。此外,本文还讨论了研究所面临的挑战和未来研究方向,鼓励研究人员改进现有方法并提出创新技术。
https://arxiv.org/abs/2409.07128
A new algorithm for incremental learning in the context of Tiny Machine learning (TinyML) is presented, which is optimized for low-performance and energy efficient embedded devices. TinyML is an emerging field that deploys machine learning models on resource-constrained devices such as microcontrollers, enabling intelligent applications like voice recognition, anomaly detection, predictive maintenance, and sensor data processing in environments where traditional machine learning models are not feasible. The algorithm solve the challenge of catastrophic forgetting through the use of knowledge distillation to create a small, distilled dataset. The novelty of the method is that the size of the model can be adjusted dynamically, so that the complexity of the model can be adapted to the requirements of the task. This offers a solution for incremental learning in resource-constrained environments, where both model size and computational efficiency are critical factors. Results show that the proposed algorithm offers a promising approach for TinyML incremental learning on embedded devices. The algorithm was tested on five datasets including: CIFAR10, MNIST, CORE50, HAR, Speech Commands. The findings indicated that, despite using only 43% of Floating Point Operations (FLOPs) compared to a larger fixed model, the algorithm experienced a negligible accuracy loss of just 1%. In addition, the presented method is memory efficient. While state-of-the-art incremental learning is usually very memory intensive, the method requires only 1% of the original data set.
本文介绍了一种在资源受限的嵌入式设备上进行逐步学习的新算法,该算法优化了Tiny Machine learning(TinyML)中的逐步学习。TinyML是一个新兴的领域,将机器学习模型应用于资源受限的设备,如微控制器,使得可以在类似于语音识别、异常检测、预测维护和传感器数据处理的环境中实现智能应用。通过使用知识蒸馏技术创建一个小型、蒸馏的数据集,该算法通过解决灾难性遗忘来优化模型的性能和能效。这种方法的新颖之处在于,模型的规模可以根据需要动态调整,以适应任务的复杂性。这为在资源受限环境中实现逐步学习提供了解决方案,其中模型大小和计算效率都是关键因素。实验结果表明,与较大的固定模型相比,所提出的算法在嵌入式设备上进行逐步学习具有 promising 的前景。该算法在包括CIFAR10、MNIST、CORE50和HAR的五个数据集上进行了测试。研究结果表明,尽管该算法使用了仅为43%的浮点运算(FLOPs) compared to a larger fixed model,但该算法仅出现了1%的微小准确率损失。此外,所提出的方法具有内存效率。与最先进的逐步学习相比,该方法只需要原始数据集的1%即可实现。
https://arxiv.org/abs/2409.07114
In this paper, we present our solution for the Second Multimodal Emotion Recognition Challenge Track 1(MER2024-SEMI). To enhance the accuracy and generalization performance of emotion recognition, we propose several methods for Multimodal Emotion Recognition. Firstly, we introduce EmoVCLIP, a model fine-tuned based on CLIP using vision-language prompt learning, designed for video-based emotion recognition tasks. By leveraging prompt learning on CLIP, EmoVCLIP improves the performance of pre-trained CLIP on emotional videos. Additionally, to address the issue of modality dependence in multimodal fusion, we employ modality dropout for robust information fusion. Furthermore, to aid Baichuan in better extracting emotional information, we suggest using GPT-4 as the prompt for Baichuan. Lastly, we utilize a self-training strategy to leverage unlabeled videos. In this process, we use unlabeled videos with high-confidence pseudo-labels generated by our model and incorporate them into the training set. Experimental results demonstrate that our model ranks 1st in the MER2024-SEMI track, achieving an accuracy of 90.15% on the test set.
在本文中,我们提出了在第二多模态情感识别挑战赛道的解决方案(MER2024-SEMI)。为了提高情感识别的准确性和泛化性能,我们提出了几种多模态情感识别的方法。首先,我们引入了EmoVCLIP,一种基于CLIP通过视觉-语言提示学习进行微调的模型,专门用于基于视频的情感识别任务。通过利用CLIP的提示学习,EmoVCLIP在情感视频中提高了预训练CLIP的性能。此外,为了应对多模态融合中模态依赖的问题,我们采用模态丢弃策略进行稳健的信息融合。此外,为了帮助Baichuan更好地提取情感信息,我们建议使用GPT-4作为提示。最后,我们利用自训练策略利用未标注的视频。在这个过程中,我们将未标注的视频生成的高置信度伪标签并入训练集。实验结果表明,我们的模型在MER2024-SEMI赛道上排名第一,达到测试集上的准确率为90.15%。
https://arxiv.org/abs/2409.07078