Conformers have recently been proposed as a promising modelling approach for automatic speech recognition (ASR), outperforming recurrent neural network-based approaches and transformers. Nevertheless, in general, the performance of these end-to-end models, especially attention-based models, is particularly degraded in the case of long utterances. To address this limitation, we propose adding a fully-differentiable memory-augmented neural network between the encoder and decoder of a conformer. This external memory can enrich the generalization for longer utterances since it allows the system to store and retrieve more information recurrently. Notably, we explore the neural Turing machine (NTM) that results in our proposed Conformer-NTM model architecture for ASR. Experimental results using Librispeech train-clean-100 and train-960 sets show that the proposed system outperforms the baseline conformer without memory for long utterances.
Conformers 最近被提出作为自动语音识别(ASR)的一种有前途的建模方法,比基于循环神经网络的方法和变压器表现更好。然而,总的来说,这些端到端模型,特别是基于注意力的模型,在较长的发言中表现特别差。为了解决这个问题,我们建议在一个 conformer 的编码器和解码器之间添加一个全变分的增强记忆神经网络。这个外部记忆可以丰富对更长发言的泛化,因为它允许系统多次存储和检索更多的信息。值得注意的是,我们探索了导致我们提出的 conformer-NTM 模型架构的神经网络 Turing 机器(NTM)。使用 LibriSpeech 训练- clean-100 和训练-960 集的实验结果表明, proposed 系统在较长的发言中比无记忆的基础 conformer 表现更好。
https://arxiv.org/abs/2309.13029
Neural network pruning offers an effective method for compressing a multilingual automatic speech recognition (ASR) model with minimal performance loss. However, it entails several rounds of pruning and re-training needed to be run for each language. In this work, we propose the use of an adaptive masking approach in two scenarios for pruning a multilingual ASR model efficiently, each resulting in sparse monolingual models or a sparse multilingual model (named as Dynamic ASR Pathways). Our approach dynamically adapts the sub-network, avoiding premature decisions about a fixed sub-network structure. We show that our approach outperforms existing pruning methods when targeting sparse monolingual models. Further, we illustrate that Dynamic ASR Pathways jointly discovers and trains better sub-networks (pathways) of a single multilingual model by adapting from different sub-network initializations, thereby reducing the need for language-specific pruning.
神经网络剪枝是一种有效的方法,以压缩具有最小性能损失的多语言自动语音识别(ASR)模型。然而,它需要进行多个语言的剪枝和再训练。在这项工作中,我们提议使用一种自适应掩蔽方法,在两个场景下高效剪枝多语言ASR模型,每个产生稀疏的 Monolingual 模型或稀疏的 Multilingual 模型(称为动态ASR通道)。我们的方法动态适应子网络,避免过早决定固定的子网络结构。我们表明,当针对稀疏的 Monolingual 模型时,我们的方法比现有的剪枝方法表现更好。此外,我们举例说明,动态ASR通道通过自适应从不同的子网络初始化中学习更好的子网络(通道),从而减少了特定语言的剪枝需求。
https://arxiv.org/abs/2309.13018
In the realm of research, the detection/recognition of text within images/videos captured by cameras constitutes a highly challenging problem for researchers. Despite certain advancements achieving high accuracy, current methods still require substantial improvements to be applicable in practical scenarios. Diverging from text detection in images/videos, this paper addresses the issue of text detection within license plates by amalgamating multiple frames of distinct perspectives. For each viewpoint, the proposed method extracts descriptive features characterizing the text components of the license plate, specifically corner points and area. Concretely, we present three viewpoints: view-1, view-2, and view-3, to identify the nearest neighboring components facilitating the restoration of text components from the same license plate line based on estimations of similarity levels and distance metrics. Subsequently, we employ the CnOCR method for text recognition within license plates. Experimental results on the self-collected dataset (PTITPlates), comprising pairs of images in various scenarios, and the publicly available Stanford Cars Dataset, demonstrate the superiority of the proposed method over existing approaches.
在研究领域,相机捕获的图像/视频中的文字检测/识别构成了一个高度挑战的问题,尽管某些方法已经实现了高精度,但当前的方法仍然需要在实际应用场景中进行大量改进。与图像/视频中的文字检测不同,本文通过整合多个不同视角的图像帧来解决 license plate 中的文字检测问题。对于每个视角,该方法提取了描述性特征,描述了 license plate 中文字组件的特征,特别是角落点和区域。具体来说,我们展示了三个视角:view-1、view-2、view-3,以确定最接近的相邻组件,通过相似度和距离度量估计来实现文字组件从同一 license plate 线条的恢复。随后,我们采用 CnOCR 方法在 license plate 中的文字识别。对自收集的数据集(PTITPlates)进行了实验结果,该数据集包括各种场景下的两个图像对,以及公开可用的 Stanford 汽车数据集,证明了该方法相对于现有方法的优越性。
https://arxiv.org/abs/2309.12972
In this work, we investigate two popular end-to-end automatic speech recognition (ASR) models, namely Connectionist Temporal Classification (CTC) and RNN-Transducer (RNN-T), for offline recognition of voice search queries, with up to 2B model parameters. The encoders of our models use the neural architecture of Google's universal speech model (USM), with additional funnel pooling layers to significantly reduce the frame rate and speed up training and inference. We perform extensive studies on vocabulary size, time reduction strategy, and its generalization performance on long-form test sets. Despite the speculation that, as the model size increases, CTC can be as good as RNN-T which builds label dependency into the prediction, we observe that a 900M RNN-T clearly outperforms a 1.8B CTC and is more tolerant to severe time reduction, although the WER gap can be largely removed by LM shallow fusion.
在本研究中,我们研究了两种流行的端到端自动语音识别(ASR)模型,即连接性时间分类(CTC)和RNN-控制器(RNN-T),以 offline 识别语音搜索查询,使用高达2B的模型参数。我们模型的编码器使用谷歌的通用语音模型(USM)的神经网络架构,并添加 funnel Pooling 层来显著降低帧率,加快训练和推断。我们深入研究了词汇量、时间减少策略以及在长篇测试集上的通用表现。尽管有人猜测,随着模型规模的增长,CTC可能不亚于 RNN-T,它将标签依赖项引入预测中,但我们观察到,一个900M的RNN-T明显 outperforms a 1.8B的CTC,并且更加容忍严重的时间减少,尽管通过LM浅融合可以大部分消除WER之间的差距。
https://arxiv.org/abs/2309.12963
Affect recognition, encompassing emotions, moods, and feelings, plays a pivotal role in human communication. In the realm of conversational artificial intelligence (AI), the ability to discern and respond to human affective cues is a critical factor for creating engaging and empathetic interactions. This study delves into the capacity of large language models (LLMs) to recognise human affect in conversations, with a focus on both open-domain chit-chat dialogues and task-oriented dialogues. Leveraging three diverse datasets, namely IEMOCAP, EmoWOZ, and DAIC-WOZ, covering a spectrum of dialogues from casual conversations to clinical interviews, we evaluated and compared LLMs' performance in affect recognition. Our investigation explores the zero-shot and few-shot capabilities of LLMs through in-context learning (ICL) as well as their model capacities through task-specific fine-tuning. Additionally, this study takes into account the potential impact of automatic speech recognition (ASR) errors on LLM predictions. With this work, we aim to shed light on the extent to which LLMs can replicate human-like affect recognition capabilities in conversations.
情感识别在人类沟通中扮演着关键角色。在对话型人工智能(AI)领域,能够分辨和响应人类情感 cues 是创造有趣和感同身受的交互的关键因素。本文探讨了大型语言模型(LLM)在对话中识别人类情感的能力,重点研究了公开领域的闲聊对话和任务驱动的对话。利用三个不同的数据集,包括IEMOCAP、EmoWOZ和DAIC-WOZ,涵盖了从闲聊对话到临床访谈的一系列对话,我们评估了和比较了LLM在情感识别方面的表现。我们的研究探索了LLM通过上下文学习(ICL)的零Shot和少量Shot能力,以及通过任务特定微调来提高其模型能力。此外,本文考虑到了自动语音识别(ASR)错误对LLM预测的潜在影响。通过这项工作,我们旨在阐明LLM在对话中能否模拟人类情感识别能力的局限性。
https://arxiv.org/abs/2309.12881
Few-shot learning has made impressive strides in addressing the crucial challenges of recognizing unknown samples from novel classes in target query sets and managing visual shifts between domains. However, existing techniques fall short when it comes to identifying target outliers under domain shifts by learning to reject pseudo-outliers from the source domain, resulting in an incomplete solution to both problems. To address these challenges comprehensively, we propose a novel approach called Domain Adaptive Few-Shot Open Set Recognition (DA-FSOS) and introduce a meta-learning-based architecture named DAFOSNET. During training, our model learns a shared and discriminative embedding space while creating a pseudo open-space decision boundary, given a fully-supervised source domain and a label-disjoint few-shot target domain. To enhance data density, we use a pair of conditional adversarial networks with tunable noise variances to augment both domains closed and pseudo-open spaces. Furthermore, we propose a domain-specific batch-normalized class prototypes alignment strategy to align both domains globally while ensuring class-discriminativeness through novel metric objectives. Our training approach ensures that DAFOS-NET can generalize well to new scenarios in the target domain. We present three benchmarks for DA-FSOS based on the Office-Home, mini-ImageNet/CUB, and DomainNet datasets and demonstrate the efficacy of DAFOS-NET through extensive experimentation
有限次学习在解决目标查询集合中 novel 类的新样本以及在不同域之间的视觉转换方面取得了令人印象深刻的进展。然而,现有技术在域转换下识别目标异常样本方面存在缺陷,通过学习从源域中拒绝源域中的伪异常样本,导致对两个问题的不完整解决方案。为了全面解决这些挑战,我们提出了一种名为“域自适应有限次开放集识别”(DA-FSOS)的新方法,并介绍了名为 DAFOSNET 的元学习架构。在训练期间,我们的模型学习一个共享且具有区别性的嵌入空间,同时创建一个伪开放空间的决策边界,给定一个完全监督的源域和一个标签独立的有限次目标域。为了增强数据密度,我们使用具有可调节噪声均值的两个条件对抗网络,增加两个域的关闭和伪开放空间。此外,我们提出了一个域特定的批量归一化类原型对齐策略,以全球对齐两个域,同时通过新度量目标保证类分类性。我们的训练方法确保了 DAFOS-NET 可以在目标域中的新场景下泛化良好。基于 Office-Home、迷你 ImageNet/CUB 和 DomainNet 数据集,我们提出了三个基准指标,用于 DA-FSOS,并通过广泛的实验证明了 DAFOS-NET 的效力。
https://arxiv.org/abs/2309.12814
Open-set object recognition aims to identify if an object is from a class that has been encountered during training or not. To perform open-set object recognition accurately, a key challenge is how to reduce the reliance on spurious-discriminative features. In this paper, motivated by that different large models pre-trained through different paradigms can possess very rich while distinct implicit knowledge, we propose a novel framework named Large Model Collaboration (LMC) to tackle the above challenge via collaborating different off-the-shelf large models in a training-free manner. Moreover, we also incorporate the proposed framework with several novel designs to effectively extract implicit knowledge from large models. Extensive experiments demonstrate the efficacy of our proposed framework. Code is available \href{this https URL}{here}.
开放集对象识别的目标是确定一个对象是否属于训练过程中遇到的一类。要准确进行开放集对象识别,一个关键挑战是如何减少对伪分类特征的依赖。为此,我们提出了一种名为“大型模型协作”(LMC)的新框架,以通过无训练方式协作不同批量的大型模型,以解决上述挑战。此外,我们还与多个新设计结合,有效地从大型模型中提取隐含知识。广泛的实验证明了我们提出的框架的有效性。代码可用以下链接获取:这里。
https://arxiv.org/abs/2309.12780
Self-supervised representation learning (SSRL) has improved the performance on downstream phoneme recognition versus supervised models. Training SSRL models requires a large amount of pre-training data and this poses a challenge for low resource languages. A common approach is transferring knowledge from other languages. Instead, we propose to use audio augmentation to pre-train SSRL models in a low resource condition and evaluate phoneme recognition as downstream task. We performed a systematic comparison of augmentation techniques, namely: pitch variation, noise addition, accented target-language speech and other language speech. We found combined augmentations (noise/pitch) was the best augmentation strategy outperforming accent and language knowledge transfer. We compared the performance with various quantities and types of pre-training data. We examined the scaling factor of augmented data to achieve equivalent performance to models pre-trained with target domain speech. Our findings suggest that for resource constrained languages, in-domain synthetic augmentation can outperform knowledge transfer from accented or other language speech.
自监督表示学习(SSRL)已经提高了后续音节识别相对于监督模型的性能。训练SSRL模型需要大量的预训练数据,这对资源有限的语言来说是一个挑战。一种常见方法是从其他语言中转移知识。相反,我们建议利用音频增强来在资源有限的情况下预训练SSRL模型,并将音节识别作为后续任务进行评估。我们对增强技术进行了系统性的比较,包括音调变化、噪音添加、目标语言语音带有口音以及其他语言语音。我们发现合并增强(噪音/音调)是最佳的增强策略,比口音和语言知识转移表现更好。我们与各种数量和类型的预训练数据进行了比较,并研究了增强数据的缩放因子,以获得与目标语言语音预训练模型相当的性能。我们的发现表明,对于资源有限的语言来说,跨域合成增强可以优于带有口音或其他语言语音的知识转移。
https://arxiv.org/abs/2309.12763
Video analysis is a major computer vision task that has received a lot of attention in recent years. The current state-of-the-art performance for video analysis is achieved with Deep Neural Networks (DNNs) that have high computational costs and need large amounts of labeled data for training. Spiking Neural Networks (SNNs) have significantly lower computational costs (thousands of times) than regular non-spiking networks when implemented on neuromorphic hardware. They have been used for video analysis with methods like 3D Convolutional Spiking Neural Networks (3D CSNNs). However, these networks have a significantly larger number of parameters compared with spiking 2D CSNN. This, not only increases the computational costs, but also makes these networks more difficult to implement with neuromorphic hardware. In this work, we use CSNNs trained in an unsupervised manner with the Spike Timing-Dependent Plasticity (STDP) rule, and we introduce, for the first time, Spiking Separated Spatial and Temporal Convolutions (S3TCs) for the sake of reducing the number of parameters required for video analysis. This unsupervised learning has the advantage of not needing large amounts of labeled data for training. Factorizing a single spatio-temporal spiking convolution into a spatial and a temporal spiking convolution decreases the number of parameters of the network. We test our network with the KTH, Weizmann, and IXMAS datasets, and we show that S3TCs successfully extract spatio-temporal information from videos, while increasing the output spiking activity, and outperforming spiking 3D convolutions.
视频分析是近年来备受关注的计算机视觉任务之一。当前视频分析的最佳性能是通过深度神经网络(DNN)实现的,这些网络具有高计算成本,需要大量的标记数据进行训练。在神经生成硬件上实现Spiking Neural Networks(SNNs)具有显著更低的计算成本(数千倍),比传统的非Spiking Neural Networks(NNNs)更低。这些方法,例如3DConvolutionalSpiking Neural Networks(3D CSNNs),已经被用于视频分析。然而,与Spiking 2D CSNNs相比,这些网络具有更多的参数,这不仅增加了计算成本,也使这些网络在神经生成硬件上实现变得更加困难。在本文中,我们使用无监督的Spiking Timing-Dependent Plasticity(STDP)规则训练的CSNNs,以降低视频分析所需的参数数量,并首次介绍了Spiking Separated Spatial and TemporalConvolutions(S3TCs),以减少视频分析所需的参数数量。这种无监督学习的优势是不需要大量标记数据进行训练。将单个时间和空间SpikingConvolution分解成空间和时间SpikingConvolution,可以降低网络中的参数数量。我们使用KTH、Weizmann和IXMAS数据集测试我们的网络,并表明S3TCs成功地从视频中提取时间和空间信息,同时增加输出SpikingActivity,并超越了Spiking 3DConvolutions。
https://arxiv.org/abs/2309.12761
Speech Emotion Recognition (SER) plays a pivotal role in enhancing human-computer interaction by enabling a deeper understanding of emotional states across a wide range of applications, contributing to more empathetic and effective communication. This study proposes an innovative approach that integrates self-supervised feature extraction with supervised classification for emotion recognition from small audio segments. In the preprocessing step, to eliminate the need of crafting audio features, we employed a self-supervised feature extractor, based on the Wav2Vec model, to capture acoustic features from audio data. Then, the output featuremaps of the preprocessing step are fed to a custom designed Convolutional Neural Network (CNN)-based model to perform emotion classification. Utilizing the ShEMO dataset as our testing ground, the proposed method surpasses two baseline methods, i.e. support vector machine classifier and transfer learning of a pretrained CNN. comparing the propose method to the state-of-the-art methods in SER task indicates the superiority of the proposed method. Our findings underscore the pivotal role of deep unsupervised feature learning in elevating the landscape of SER, offering enhanced emotional comprehension in the realm of human-computer interactions.
语音情感识别( SER )在增强人机交互方面发挥着关键作用,通过使对多种应用程序中情感状态的理解更深入,有助于更同情心和有效的沟通。本研究提出了一种创新的方法,将自监督特征提取与监督分类集成,从小型音频片段中情感识别。在预处理步骤中,为了消除制作音频特征的需要,我们采用了基于瓦夫2Vec模型的自监督特征提取器,从音频数据中提取声学特征。然后将预处理步骤的输出特征映射喂入一个定制的卷积神经网络(CNN)基模型,进行情感分类。利用 ShEMO 数据集作为我们的测试集,所提出的方法超过了两个基准方法,即支持向量机分类器和预先训练的 CNN 的迁移学习。将所提出的方法与 SER 任务中的先进方法进行比较表明这种方法的优势。我们的研究结果强调了深度无监督特征学习在提高 SER 景观方面的关键作用,提供了在人机交互领域中增强情感理解。
https://arxiv.org/abs/2309.12714
Recent progress in Automatic Speech Recognition (ASR) has been coupled with a substantial increase in the model sizes, which may now contain billions of parameters, leading to slow inferences even with adapted hardware. In this context, several ASR models exist in various sizes, with different inference costs leading to different performance levels. Based on the observation that smaller models perform optimally on large parts of testing corpora, we propose to train a decision module, that would allow, given an audio sample, to use the smallest sufficient model leading to a good transcription. We apply our approach to two Whisper models with different sizes. By keeping the decision process computationally efficient, we build a decision module that allows substantial computational savings with reduced performance drops.
最近Automatic Speech Recognition (ASR)的进步与模型大小的显著增加相耦合,这些模型现在可能包含数十亿参数,导致即使使用适应硬件也缓慢Inference。在这种情况下,存在几种不同大小的ASR模型,不同的Inference成本导致不同的性能水平。基于观察到小型模型在测试数据集的大部分方面表现最佳,我们建议训练一个决策模块,给定一个音频样本,使用最小的足够模型,从而得到良好的转录。我们分别对两个不同大小的Whisper模型应用了我们的方法。通过保持决策过程计算高效的模式,我们构建了一个决策模块,允许实现显著的计算节省,同时减少了性能下降。
https://arxiv.org/abs/2309.12712
This paper details our speaker diarization system designed for multi-domain, multi-microphone casual conversations. The proposed diarization pipeline uses weighted prediction error (WPE)-based dereverberation as a front end, then applies end-to-end neural diarization with vector clustering (EEND-VC) to each channel separately. It integrates the diarization result obtained from each channel using diarization output voting error reduction plus overlap (DOVER-LAP). To harness the knowledge from the target domain and results integrated across all channels, we apply self-supervised adaptation for each session by retraining the EEND-VC with pseudo-labels derived from DOVER-LAP. The proposed system was incorporated into NTT's submission for the distant automatic speech recognition task in the CHiME-7 challenge. Our system achieved 65 % and 62 % relative improvements on development and eval sets compared to the organizer-provided VC-based baseline diarization system, securing third place in diarization performance.
这篇文章介绍了我们设计的适用于多领域、多麦克风 casual conversations 的 speaker diarization 系统。该 diarization 系统采用基于加权预测误差(WPE)的声学去混响作为前端,然后将全端神经网络声学归一化与向量聚类(EEND-VC)分别应用于每个通道。它通过减少 diarization 输出的投票错误以及融合(DOVER-LAP)来实现每个通道的声学归一化结果,并将它们整合在一起。为了从目标领域中提取知识和将整合在所有通道中的结果进行训练,我们在每个会话中使用自监督适应技术,通过从 DOVER-LAP 中推导出伪标签来重新训练 EEND-VC。该提议系统被 NTT 纳入了 CHiME-7 挑战中远程自动语音识别任务提交的候选列表中。我们的系统相对于组织者提供的基于 VC 基线的声学基线 diarization 系统在开发和应用集上实现了 65 % 和 62 % 的相对改进,确保了声学归一化性能的第三名。
https://arxiv.org/abs/2309.12656
Surface electromyography (sEMG) and high-density sEMG (HD-sEMG) biosignals have been extensively investigated for myoelectric control of prosthetic devices, neurorobotics, and more recently human-computer interfaces because of their capability for hand gesture recognition/prediction in a wearable and non-invasive manner. High intraday (same-day) performance has been reported. However, the interday performance (separating training and testing days) is substantially degraded due to the poor generalizability of conventional approaches over time, hindering the application of such techniques in real-life practices. There are limited recent studies on the feasibility of multi-day hand gesture recognition. The existing studies face a major challenge: the need for long sEMG epochs makes the corresponding neural interfaces impractical due to the induced delay in myoelectric control. This paper proposes a compact ViT-based network for multi-day dynamic hand gesture prediction. We tackle the main challenge as the proposed model only relies on very short HD-sEMG signal windows (i.e., 50 ms, accounting for only one-sixth of the convention for real-time myoelectric implementation), boosting agility and responsiveness. Our proposed model can predict 11 dynamic gestures for 20 subjects with an average accuracy of over 71% on the testing day, 3-25 days after training. Moreover, when calibrated on just a small portion of data from the testing day, the proposed model can achieve over 92% accuracy by retraining less than 10% of the parameters for computational efficiency.
表面电感测量(sEMG)和高密度sEMG(HD-sEMG)生物信号已经被广泛研究用于肢体残疾控制、神经机器人学以及最近的人机接口,因为它们能够在佩戴且非侵入性的情况下进行手动作识别/预测。每日(当天)表现 据报道很高。然而,每日表现(区分训练和测试日)因传统方法的泛化性能较差而大幅度退化,阻碍将这些技术应用于实际实践中。目前,关于一天多次手动作识别的可行性研究有限。现有的研究面临一个主要挑战:需要长sEMG epochs导致相应的神经接口不可能实现,因为肌电控制引起的延迟。本 paper 提出了一种紧凑的ViT-based网络,用于一天多次的动态手动作预测。我们克服了主要挑战,因为 proposed 模型只需要非常短的HD-sEMG信号窗口(即50 ms,只占实时肌电实现的传统标准的六分之一),提高敏捷性和响应性。我们 proposed 模型可以预测20名 subjects 11种动态手势,在测试日,平均准确率超过71%,训练3-25天后。此外,当仅从测试日的数据中校准一小部分数据时,该模型可以实现超过92%的准确率,通过减少计算效率不到10%的参数重新训练。
https://arxiv.org/abs/2309.12602
In surveillance, accurately recognizing license plates is hindered by their often low quality and small dimensions, compromising recognition precision. Despite advancements in AI-based image super-resolution, methods like Convolutional Neural Networks (CNNs) and Generative Adversarial Networks (GANs) still fall short in enhancing license plate images. This study leverages the cutting-edge diffusion model, which has consistently outperformed other deep learning techniques in image restoration. By training this model using a curated dataset of Saudi license plates, both in low and high resolutions, we discovered the diffusion model's superior efficacy. The method achieves a 12.55\% and 37.32% improvement in Peak Signal-to-Noise Ratio (PSNR) over SwinIR and ESRGAN, respectively. Moreover, our method surpasses these techniques in terms of Structural Similarity Index (SSIM), registering a 4.89% and 17.66% improvement over SwinIR and ESRGAN, respectively. Furthermore, 92% of human evaluators preferred our images over those from other algorithms. In essence, this research presents a pioneering solution for license plate super-resolution, with tangible potential for surveillance systems.
在监控中,准确识别车牌常常由于它们的质量较低和尺寸较小而受到限制,从而影响识别精度。尽管基于人工智能的图像超分辨率技术取得了进展,但像卷积神经网络(CNNs)和生成对抗网络(GANs)等方法在增强车牌图像方面仍无法满足要求。本研究利用最先进的扩散模型,该模型在图像恢复方面一直比其他深度学习技术表现更好。通过使用沙特车牌的 curated 数据集,以低和高分辨率两种形式训练该模型,我们发现了扩散模型的优越性。方法在峰值信号-噪声比(PSNR)方面实现了12.55\%和37.32%的 improvement,分别比 SwinIR 和ESRGAN 提高了37.32%和12.55%。此外,我们的方法和这些技术在结构相似性指数(SSIM)方面超过了它们,分别提高了4.89%和17.66%。此外,92%的人类评估者认为我们的图像比来自其他算法的图像更喜欢。本研究提出了车牌超分辨率的开创性解决方案,对于监控系统具有实际潜力。
https://arxiv.org/abs/2309.12506
In recent times, there is an increased interest in the identification and re-identification of people at long distances, such as from rooftop cameras, UAV cameras, street cams, and others. Such recognition needs to go beyond face and use whole-body markers such as gait. However, datasets to train and test such recognition algorithms are not widely prevalent, and fewer are labeled. This paper introduces DIOR -- a framework for data collection, semi-automated annotation, and also provides a dataset with 14 subjects and 1.649 million RGB frames with 3D/2D skeleton gait labels, including 200 thousands frames from a long range camera. Our approach leverages advanced 3D computer vision techniques to attain pixel-level accuracy in indoor settings with motion capture systems. Additionally, for outdoor long-range settings, we remove the dependency on motion capture systems and adopt a low-cost, hybrid 3D computer vision and learning pipeline with only 4 low-cost RGB cameras, successfully achieving precise skeleton labeling on far-away subjects, even when their height is limited to a mere 20-25 pixels within an RGB frame. On publication, we will make our pipeline open for others to use.
近年来,对远距离识别和重新识别的兴趣日益增加,例如从屋顶摄像头、无人机摄像头、街头摄像头和其他设备中拍摄的图像。这种识别需要超越面部识别,使用全身标志,如步态。然而,训练和测试这种识别算法的 datasets 并不普遍,标记的样本更少。本文介绍了 DIOR - 一个数据收集、半自动标注的框架,并提供了包含14个 subjects 和1.649百万张 RGB 帧的三维/二维骨骼步态标签的数据集,其中包括从远程相机拍摄200 thousands帧的图像。我们的方法利用先进的三维计算机视觉技术在室内条件下实现像素级别的精度。此外,对于室外远距离设置,我们摆脱了对运动捕捉系统的依赖性,采用仅4个低成本 RGB 相机的低成本三维计算机视觉和学习通道,成功对远距离样本进行精确的骨骼标签标注,即使样本的高度仅在RGB帧内仅有20-25像素。在出版时,我们将我们的通道开放给他人使用。
https://arxiv.org/abs/2309.12429
The objective of this work is the effective extraction of spatial and dynamic features for Continuous Sign Language Recognition (CSLR). To accomplish this, we utilise a two-pathway SlowFast network, where each pathway operates at distinct temporal resolutions to separately capture spatial (hand shapes, facial expressions) and dynamic (movements) information. In addition, we introduce two distinct feature fusion methods, carefully designed for the characteristics of CSLR: (1) Bi-directional Feature Fusion (BFF), which facilitates the transfer of dynamic semantics into spatial semantics and vice versa; and (2) Pathway Feature Enhancement (PFE), which enriches dynamic and spatial representations through auxiliary subnetworks, while avoiding the need for extra inference time. As a result, our model further strengthens spatial and dynamic representations in parallel. We demonstrate that the proposed framework outperforms the current state-of-the-art performance on popular CSLR datasets, including PHOENIX14, PHOENIX14-T, and CSL-Daily.
本研究的目标是有效地提取连续口语语言识别(CSLR)所需的空间动态特征。为了实现这一目标,我们使用了两路径的慢快网络,每个路径都采用不同的时间分辨率,分别捕获空间(手形状、面部表情)和动态(动作)信息。此外,我们引入了两个专门设计的特征融合方法,分别为(1)双向特征融合(BFF),该方法有助于将动态语义转化为空间语义,反之亦然;(2)路径特征增强(PFE),该方法通过辅助子网络丰富动态和空间表示,同时避免额外的推理时间。因此,我们的模型并行加强了空间和动态表示。我们证明了所提出的框架在流行的CSLR数据集上优于当前最先进的性能,包括PHOENIX14、PHOENIX14-T和CSL-每日。
https://arxiv.org/abs/2309.12304
Large language models (LLMs) have demonstrated dominating performance in many NLP tasks, especially on generative tasks. However, they often fall short in some information extraction tasks, particularly those requiring domain-specific knowledge, such as Biomedical Named Entity Recognition (NER). In this paper, inspired by Chain-of-thought, we leverage the LLM to solve the Biomedical NER step-by-step: break down the NER task into entity span extraction and entity type determination. Additionally, for entity type determination, we inject entity knowledge to address the problem that LLM's lack of domain knowledge when predicting entity category. Experimental results show a significant improvement in our two-step BioNER approach compared to previous few-shot LLM baseline. Additionally, the incorporation of external knowledge significantly enhances entity category determination performance.
大型语言模型(LLMs)在许多自然语言处理任务中表现出主导性表现,特别是在生成任务方面。然而,它们在一些信息提取任务中往往表现不佳,尤其是需要特定领域知识的领域特定知识任务,例如生物医学领域的命名实体识别(NER)。在本文中,受到思维流程的启发,我们利用LLM解决生物医学领域的NER任务:将NER任务分解成实体范围提取和实体类型确定:此外,对于实体类型确定,我们注入实体知识来解决LLM在预测实体类别时缺乏领域知识的问题。实验结果显示,我们两步生物医学NER方法相比先前的几个LLM基准任务表现出了显著改进。此外,将外部知识融入实体类别确定任务中显著增强了性能。
https://arxiv.org/abs/2309.12278
In this study, we present synchronous bilingual Connectionist Temporal Classification (CTC), an innovative framework that leverages dual CTC to bridge the gaps of both modality and language in the speech translation (ST) task. Utilizing transcript and translation as concurrent objectives for CTC, our model bridges the gap between audio and text as well as between source and target languages. Building upon the recent advances in CTC application, we develop an enhanced variant, BiL-CTC+, that establishes new state-of-the-art performances on the MuST-C ST benchmarks under resource-constrained scenarios. Intriguingly, our method also yields significant improvements in speech recognition performance, revealing the effect of cross-lingual learning on transcription and demonstrating its broad applicability. The source code is available at this https URL.
在本研究中,我们提出了同步双语连接性时间分类(CTC)创新框架,该框架利用双CTC技术,在言语翻译(ST)任务中解决modality和语言之间的空白。通过将摘要和翻译作为CTC的并发目标,我们的模型实现了音频和文本之间的空白以及源语言和目标语言之间的空白。基于最近在CTC应用方面的进展情况,我们开发了一个增强版本, BiL-CTC+,在资源受限的情况下,在MUST-C ST基准测试数据上创造了新的最先进的性能。令人惊奇的是,我们的方法还显著提高了语音识别性能,揭示了跨语言学习对摘要的影响,并证明了其广泛适用性。源代码可以在以下httpsURL上获取。
https://arxiv.org/abs/2309.12234
Neural network approaches to single-channel speech enhancement have received much recent attention. In particular, mask-based architectures have achieved significant performance improvements over conventional methods. This paper proposes a multiscale autoencoder (MSAE) for mask-based end-to-end neural network speech enhancement. The MSAE performs spectral decomposition of an input waveform within separate band-limited branches, each operating with a different rate and scale, to extract a sequence of multiscale embeddings. The proposed framework features intuitive parameterization of the autoencoder, including a flexible spectral band design based on the Constant-Q transform. Additionally, the MSAE is constructed entirely of differentiable operators, allowing it to be implemented within an end-to-end neural network, and be discriminatively trained. The MSAE draws motivation both from recent multiscale network topologies and from traditional multiresolution transforms in speech processing. Experimental results show the MSAE to provide clear performance benefits relative to conventional single-branch autoencoders. Additionally, the proposed framework is shown to outperform a variety of state-of-the-art enhancement systems, both in terms of objective speech quality metrics and automatic speech recognition accuracy.
神经网络对单通道语音增强的研究最近受到了广泛关注。特别是,基于Mask的架构在与传统方法相比实现了显著的性能提升。本文提出了一种基于Mask的多维度自编码器(MSAE),用于实现基于Mask的端到端神经网络语音增强。MSAE在 separate band-limited分支内执行谱分解操作,每个分支以不同的速率和尺度运行,以提取多尺度嵌入序列。 proposed 框架采用直觉的自编码器参数化,包括基于康普顿-Q变换的灵活谱带设计。此外,MSAE完全由不同的操作员构建,使其能够在端到端神经网络内部实现,并进行有选择性的训练。MSAE从最近的多尺度网络拓扑和传统语音处理中的多分辨率变换中吸取了动力。实验结果表明,与传统的单分支自编码器相比,MSAE可以提供明显的性能优势。此外, proposed 框架在 objective speech quality metrics 和自动语音识别精度方面击败了多种最先进的增强系统。
https://arxiv.org/abs/2309.12121
In recent years face recognition systems have been brought to the mainstream due to development in hardware and software. Consistent efforts are being made to make them better and more secure. This has also brought developments in 3D face recognition systems at a rapid pace. These 3DFR systems are expected to overcome certain vulnerabilities of 2DFR systems. One such problem that the domain of 2DFR systems face is face image morphing. A substantial amount of research is being done for generation of high quality face morphs along with detection of attacks from these morphs. Comparatively the understanding of vulnerability of 3DFR systems against 3D face morphs is less. But at the same time an expectation is set from 3DFR systems to be more robust against such attacks. This paper attempts to research and gain more information on this matter. The paper describes a couple of methods that can be used to generate 3D face morphs. The face morphs that are generated using this method are then compared to the contributing faces to obtain similarity scores. The highest MMPMR is obtained around 40% with RMMR of 41.76% when 3DFRS are attacked with look-a-like morphs.
近年来,由于硬件和软件的发展,人脸识别系统逐渐进入主流领域。一致性的努力正在致力于提高和确保安全。这也促进了三维人脸识别系统的迅速发展。这些3D FSR系统被认为能够克服2D FSR系统的一些脆弱性。其中一个问题是2D FSR系统 Face 图像变形的问题。大量研究正在用于生成高质量的人脸变形,同时检测这些变形的攻击。与3D FSR系统相比,对3D FSR系统对人脸变形的脆弱性的了解较少。但与此同时,3D FSR系统被认为会更加稳健地抵御这些攻击。本文试图研究并获取有关这个问题更多信息。本文描述了几种可以用来生成3D人脸变形的方法。使用这种方法生成的人脸变形随后与贡献的人脸进行比较,以获得相似性得分。当3DFRS受到类似于人脸变形的攻击时, MMPMR的最高值约为40%,RMMR为41.76%。
https://arxiv.org/abs/2309.12118