Neural machine translation (NMT) has shown impressive performance when trained on large-scale corpora. However, generic NMT systems have demonstrated poor performance on out-of-domain translation. To mitigate this issue, several domain adaptation methods have recently been proposed which often lead to better translation quality than genetic NMT systems. While there has been some continuous progress in NMT for English and other European languages, domain adaption in Arabic has received little attention in the literature. The current study, therefore, aims to explore the effectiveness of domain-specific adaptation for Arabic MT (AMT), in yet unexplored domain, financial news articles. To this end, we developed carefully a parallel corpus for Arabic-English (AR- EN) translation in the financial domain for benchmarking different domain adaptation methods. We then fine-tuned several pre-trained NMT and Large Language models including ChatGPT-3.5 Turbo on our dataset. The results showed that the fine-tuning is successful using just a few well-aligned in-domain AR-EN segments. The quality of ChatGPT translation was superior than other models based on automatic and human evaluations. To the best of our knowledge, this is the first work on fine-tuning ChatGPT towards financial domain transfer learning. To contribute to research in domain translation, we made our datasets and fine-tuned models available at this https URL.
神经网络机器翻译(NMT)在大规模语料库上训练时表现令人印象深刻。然而,通用NMT系统在跨域翻译方面表现出较差的性能。为了解决这个问题,近年来提出了许多域适应方法,这些方法通常比遗传的NMT系统提供更好的翻译质量。虽然英语和其他欧洲语言在NMT方面取得了一些进展,但在阿拉伯语的域适应方面在文献中却较少关注。因此,本研究的目的是探索阿拉伯MT(AMT)在尚未探索过的域——金融新闻 articles 中的域特定适应效果。为此,我们仔细开发了金融域中的阿拉伯-英语(AR- EN)翻译平行语料库,以基准不同的域适应方法。随后,我们对几个预先训练的NMT和大型语言模型,包括 ChatGPT-3.5 Turbo 进行了微调,在我们的数据集上成功进行了微调。结果显示,仅仅使用一些与域相关的 AR-EN Segments 就可以成功进行微调。ChatGPT 翻译的质量基于自动和人工评估被认为比其他模型更好。据我们所知,这是第一个针对金融域迁移学习的研究。为了做出贡献到域翻译研究,我们将该数据集和微调模型放在了这个 https URL 上。
https://arxiv.org/abs/2309.12863
Speech Emotion Recognition (SER) plays a pivotal role in enhancing human-computer interaction by enabling a deeper understanding of emotional states across a wide range of applications, contributing to more empathetic and effective communication. This study proposes an innovative approach that integrates self-supervised feature extraction with supervised classification for emotion recognition from small audio segments. In the preprocessing step, to eliminate the need of crafting audio features, we employed a self-supervised feature extractor, based on the Wav2Vec model, to capture acoustic features from audio data. Then, the output featuremaps of the preprocessing step are fed to a custom designed Convolutional Neural Network (CNN)-based model to perform emotion classification. Utilizing the ShEMO dataset as our testing ground, the proposed method surpasses two baseline methods, i.e. support vector machine classifier and transfer learning of a pretrained CNN. comparing the propose method to the state-of-the-art methods in SER task indicates the superiority of the proposed method. Our findings underscore the pivotal role of deep unsupervised feature learning in elevating the landscape of SER, offering enhanced emotional comprehension in the realm of human-computer interactions.
语音情感识别( SER )在增强人机交互方面发挥着关键作用,通过使对多种应用程序中情感状态的理解更深入,有助于更同情心和有效的沟通。本研究提出了一种创新的方法,将自监督特征提取与监督分类集成,从小型音频片段中情感识别。在预处理步骤中,为了消除制作音频特征的需要,我们采用了基于瓦夫2Vec模型的自监督特征提取器,从音频数据中提取声学特征。然后将预处理步骤的输出特征映射喂入一个定制的卷积神经网络(CNN)基模型,进行情感分类。利用 ShEMO 数据集作为我们的测试集,所提出的方法超过了两个基准方法,即支持向量机分类器和预先训练的 CNN 的迁移学习。将所提出的方法与 SER 任务中的先进方法进行比较表明这种方法的优势。我们的研究结果强调了深度无监督特征学习在提高 SER 景观方面的关键作用,提供了在人机交互领域中增强情感理解。
https://arxiv.org/abs/2309.12714
Brain tumors are collections of abnormal cells that can develop into masses or clusters. Because they have the potential to infiltrate other tissues, they pose a risk to the patient. The main imaging technique used, MRI, may be able to identify a brain tumor with accuracy. The fast development of Deep Learning methods for use in computer vision applications has been facilitated by a vast amount of training data and improvements in model construction that offer better approximations in a supervised setting. The need for these approaches has been the main driver of this expansion. Deep learning methods have shown promise in improving the precision of brain tumor detection and classification using magnetic resonance imaging (MRI). The study on the use of deep learning techniques, especially ResNet50, for brain tumor identification is presented in this abstract. As a result, this study investigates the possibility of automating the detection procedure using deep learning techniques. In this study, I utilized five transfer learning models which are VGG16, VGG19, DenseNet121, ResNet50 and YOLO V4 where ResNet50 provide the best or highest accuracy 99.54%. The goal of the study is to guide researchers and medical professionals toward powerful brain tumor detecting systems by employing deep learning approaches by way of this evaluation and analysis.
脑瘤是异常细胞组成的集合,可能形成群体或聚集。因为它们有进入其他组织的潜力,它们对患者构成风险。使用MRI作为主要成像技术可能会准确识别脑瘤。深度学习方法在计算机视觉应用中的快速发展促进了大量训练数据和模型构建改进,在监督环境下提供更好的近似。这些方法的需求一直是这种扩展的主要驱动力。深度学习方法已经表现出 promise 来改善使用磁共振成像(MRI)进行脑瘤检测和分类的精度。本文介绍了使用深度学习技术特别是 ResNet50 进行脑瘤识别的研究。因此,本文研究了使用深度学习技术自动化检测方法的可能性。在本研究中,我使用了五个转移学习模型,它们是 VGG16、VGG19、DenseNet121、ResNet50 和 YOLO V4,ResNet50 提供最高的或最高水平的准确率 99.54%。研究的目标是通过使用深度学习方法来指导研究人员和医疗保健专业人员开发强大的脑瘤检测系统。
https://arxiv.org/abs/2309.12193
Pneumonia is the leading infectious cause of infant death in the world. When identified early, it is possible to alter the prognosis of the patient, one could use imaging exams to help in the diagnostic confirmation. Performing and interpreting the exams as soon as possible is vital for a good treatment, with the most common exam for this pathology being chest X-ray. The objective of this study was to develop a software that identify the presence or absence of pneumonia in chest radiographs. The software was developed as a computational model based on machine learning using transfer learning technique. For the training process, images were collected from a database available online with children's chest X-rays images taken at a hospital in China. After training, the model was then exposed to new images, achieving relevant results on identifying such pathology, reaching 98% sensitivity and 97.3% specificity for the sample used for testing. It can be concluded that it is possible to develop a software that identifies pneumonia in chest X-ray images.
肺炎是世界婴儿致死的主要原因之一。如果早期识别,就能够改变患者的预后,可以使用影像检查来帮助诊断确认。尽快进行和解读检查对于获得良好的治疗至关重要,这种病理学检查最常见的方法是胸片。本研究的目的是开发一种软件,用于在胸片上识别是否存在肺炎。该软件是基于机器学习技术使用迁移学习方法开发的计算模型。在训练过程中,从可用的在线数据库中收集了与在中国医院拍摄的孩子胸片图像相关的图像。训练完成后,模型被暴露在新的图像上,并在识别这种病理学方面取得了相关结果,对于用于测试的样本的敏感性达到了98%,特异性达到了97.3%。因此,可以得出结论,开发能够在胸片图像上识别肺炎的软件是可能的。
https://arxiv.org/abs/2309.11995
Human-Computer Interaction (HCI) has been the subject of research for many years, and recent studies have focused on improving its performance through various techniques. In the past decade, deep learning studies have shown high performance in various research areas, leading researchers to explore their application to HCI. Convolutional neural networks can be used to recognize hand gestures from images using deep architectures. In this study, we evaluated pre-trained high-performance deep architectures on the HG14 dataset, which consists of 14 different hand gesture classes. Among 22 different models, versions of the VGGNet and MobileNet models attained the highest accuracy rates. Specifically, the VGG16 and VGG19 models achieved accuracy rates of 94.64% and 94.36%, respectively, while the MobileNet and MobileNetV2 models achieved accuracy rates of 96.79% and 94.43%, respectively. We performed hand gesture recognition on the dataset using an ensemble learning technique, which combined the four most successful models. By utilizing these models as base learners and applying the Dirichlet ensemble technique, we achieved an accuracy rate of 98.88%. These results demonstrate the effectiveness of the deep ensemble learning technique for HCI and its potential applications in areas such as augmented reality, virtual reality, and game technologies.
人机交互(HCI)已经多年成为研究的主题,最近的研究则重点是如何通过各种技术改善其性能。在过去十年中,深度学习研究在各种研究领域表现出高性能,促使研究人员探索将其应用于HCI。卷积神经网络可以使用深度架构从图像中识别手势。在本研究中,我们评估了HG14数据集上的预先训练高性能深度架构,该数据集包括14个不同的手势类别。在22个不同的模型中,VGGNet和MobileNet模型的版本获得最高的准确率。具体来说,VGG16和VGG19模型的准确率分别为94.64%和94.36%,而MobileNet和MobileNetV2模型的准确率分别为96.79%和94.43%。我们使用一种集成学习技术对数据集进行手势识别,该技术结合了最成功的四个模型。利用这些模型作为基学习器并应用迪杰罗集成技术,我们实现了98.88%的准确率。这些结果表明,深度集成学习技术对于HCI非常有效,并且可能在增强现实、虚拟现实和游戏技术方面有潜在的应用。
https://arxiv.org/abs/2309.11610
We present SkeleTR, a new framework for skeleton-based action recognition. In contrast to prior work, which focuses mainly on controlled environments, we target more general scenarios that typically involve a variable number of people and various forms of interaction between people. SkeleTR works with a two-stage paradigm. It first models the intra-person skeleton dynamics for each skeleton sequence with graph convolutions, and then uses stacked Transformer encoders to capture person interactions that are important for action recognition in general scenarios. To mitigate the negative impact of inaccurate skeleton associations, SkeleTR takes relative short skeleton sequences as input and increases the number of sequences. As a unified solution, SkeleTR can be directly applied to multiple skeleton-based action tasks, including video-level action classification, instance-level action detection, and group-level activity recognition. It also enables transfer learning and joint training across different action tasks and datasets, which result in performance improvement. When evaluated on various skeleton-based action recognition benchmarks, SkeleTR achieves the state-of-the-art performance.
我们提出了SkewedTR,一个基于骨骼的动作识别框架。与之前的工作主要关注控制环境相比,我们的目标更为通用,通常涉及的人数会因人而异,并且人们之间的各种交互形式。SkewedTR采用两个阶段的范式。首先,它使用图卷积对每个骨骼序列的内骨骼动力学建模,然后使用叠加的Transformer编码器来捕捉在通用场景下对于动作识别非常重要的人之间的交互。为了减轻不准确的骨骼关联带来的负面影响,SkewedTR采用了相对较短的骨骼序列作为输入,并增加了序列的数量。作为一种统一的解决方案,SkewedTR可以直接应用于多个基于骨骼的动作任务,包括视频级别的动作分类、实例级别的动作检测和群体级别的活动识别。它还能够实现跨不同动作任务和数据集的迁移学习和联合训练,从而带来性能的提升。当在各种基于骨骼的动作识别基准上进行评价时,SkewedTR取得了最先进的表现。
https://arxiv.org/abs/2309.11445
Knitting patterns are a crucial component in the creation and design of knitted materials. Traditionally, these patterns were taught informally, but thanks to advancements in technology, anyone interested in knitting can use the patterns as a guide to start knitting. Perhaps because knitting is mostly a hobby, with the exception of industrial manufacturing utilising specialised knitting machines, the use of Al in knitting is less widespread than its application in other fields. However, it is important to determine whether knitted pattern classification using an automated system is viable. In order to recognise and classify knitting patterns. Using data augmentation and a transfer learning technique, this study proposes a deep learning model. The Inception ResNet-V2 is the main feature extraction and classification algorithm used in the model. Metrics like accuracy, logarithmic loss, F1-score, precision, and recall score were used to evaluate the model. The model evaluation's findings demonstrate high model accuracy, precision, recall, and F1 score. In addition, the AUC score for majority of the classes was in the range (0.7-0.9). A comparative analysis was done using other pretrained models and a ResNet-50 model with transfer learning and the proposed model evaluation results surpassed all others. The major limitation for this project is time, as with more time, there might have been better accuracy over a larger number of epochs.
编织 patterns 在制造和设计针织材料中是一个重要的组成部分。传统上,这些 patterns 是非正式地教授的,但得益于科技的进步,任何对针织感兴趣的人都可以使用这些 patterns 作为指南开始编织。或许是因为针织大部分是一种爱好,除了 industrial manufacturing 使用专门化的针织机器之外,在针织中的 Al 应用比在其他领域中的应用更为罕见。然而,重要的是确定使用自动化系统对编织 pattern 进行分类是否可行。为了识别和分类编织 pattern,本研究提出了一个深度学习模型。Inception ResNet-V2 是模型中的主要特征提取和分类算法。使用增强数据和迁移学习技术,本研究提出了一个深度学习模型。accuracy、logistic loss、F1-score、precision 和 recall score 等指标用于评估模型。模型评估的结果表现出高模型精度、 precision、Recall 和 F1 score。此外,大部分类别的AUC score 都在(0.7-0.9)范围内。使用其他预训练模型和一个有迁移学习的 ResNet-50 模型进行了比较分析,提出的模型评估结果超过了其他所有模型。本研究的主要限制是时间,如果有更多的时间,可能会有更好的准确性在更多的 epoch 内出现。
https://arxiv.org/abs/2309.11202
Pathology diagnosis based on EEG signals and decoding brain activity holds immense importance in understanding neurological disorders. With the advancement of artificial intelligence methods and machine learning techniques, the potential for accurate data-driven diagnoses and effective treatments has grown significantly. However, applying machine learning algorithms to real-world datasets presents diverse challenges at multiple levels. The scarcity of labelled data, especially in low regime scenarios with limited availability of real patient cohorts due to high costs of recruitment, underscores the vital deployment of scaling and transfer learning techniques. In this study, we explore a real-world pathology classification task to highlight the effectiveness of data and model scaling and cross-dataset knowledge transfer. As such, we observe varying performance improvements through data scaling, indicating the need for careful evaluation and labelling. Additionally, we identify the challenges of possible negative transfer and emphasize the significance of some key components to overcome distribution shifts and potential spurious correlations and achieve positive transfer. We see improvement in the performance of the target model on the target (NMT) datasets by using the knowledge from the source dataset (TUAB) when a low amount of labelled data was available. Our findings indicate a small and generic model (e.g. ShallowNet) performs well on a single dataset, however, a larger model (e.g. TCN) performs better on transfer and learning from a larger and diverse dataset.
基于EEG信号和解码脑活动的尸体病理学诊断在理解神经退行性疾病方面具有重要意义。随着人工智能技术和机器学习技术的进步,使用准确的数据驱动诊断和有效的治疗方案的潜力已经显著增长。然而,将机器学习算法应用于实际数据集面临着多个层次的不同挑战。标记数据短缺,特别是低政权场景下由于招聘成本很高而有限的真实患者群体,强调了 Scale-up 和转移学习技术的重要部署。在本研究中,我们探索了一个实际尸体病理学分类任务,以强调数据和模型 Scale-up 和跨数据集知识转移的有效性。因此,我们通过数据 Scale-up 观察了不同的性能改进,这表明需要仔细评估和标记。此外,我们识别了可能 negative 转移的挑战并强调了某些关键组件的重要性,以克服分布 Shift 和潜在的伪相关,并实现积极转移。我们使用来自原始数据集(TUAB)的知识,在可用少量标记数据的情况下,在目标模型(NMT)数据集上观察到目标模型的性能改进。我们的发现表明,一个小而通用的模型(如浅度网络)在单个数据集上表现良好,然而,一个大而多样化模型(如 TCN)在从更大而不同的数据集上进行转移和学习方面表现更好。
https://arxiv.org/abs/2309.10910
In this work, we explore the influence of entropy change in deep learning systems by adding noise to the inputs/latent features. The applications in this paper focus on deep learning tasks within computer vision, but the proposed theory can be further applied to other fields. Noise is conventionally viewed as a harmful perturbation in various deep learning architectures, such as convolutional neural networks (CNNs) and vision transformers (ViTs), as well as different learning tasks like image classification and transfer learning. However, this paper aims to rethink whether the conventional proposition always holds. We demonstrate that specific noise can boost the performance of various deep architectures under certain conditions. We theoretically prove the enhancement gained from positive noise by reducing the task complexity defined by information entropy and experimentally show the significant performance gain in large image datasets, such as the ImageNet. Herein, we use the information entropy to define the complexity of the task. We categorize the noise into two types, positive noise (PN) and harmful noise (HN), based on whether the noise can help reduce the complexity of the task. Extensive experiments of CNNs and ViTs have shown performance improvements by proactively injecting positive noise, where we achieved an unprecedented top 1 accuracy of over 95% on ImageNet. Both theoretical analysis and empirical evidence have confirmed that the presence of positive noise can benefit the learning process, while the traditionally perceived harmful noise indeed impairs deep learning models. The different roles of noise offer new explanations for deep models on specific tasks and provide a new paradigm for improving model performance. Moreover, it reminds us that we can influence the performance of learning systems via information entropy change.
在本文中,我们探索了通过在输入/潜在特征中添加噪声对深度学习系统熵变化的影响。本文应用着重于计算机视觉中的深度学习任务,但建议的理论可以进一步应用于其他领域。噪声通常被视为各种深度学习架构中的有害干扰,例如卷积神经网络(CNNs)和视觉转换器(ViTs),以及不同的学习任务,例如图像分类和迁移学习。然而,本文旨在重新思考传统观点,即是否传统建议总是适用。我们证明特定的噪声可以在某些条件下提高各种深度学习架构的性能。我们从理论上证明通过减少信息熵定义的任务复杂性,可以从积极噪声中获得增强,并实验性地证明在大型图像数据集上,例如ImageNet,通过主动注入积极噪声可以实现显著的性能提高。CNNs和ViTs广泛的实验已经证明通过主动注入积极噪声可以改进性能,我们在ImageNet上实现了超过95%的 unprecedented准确率。理论和实践分析都确认了积极噪声的存在可以受益于学习过程,而传统认为的有害干扰确实影响了深度学习模型。不同的角色提供了对特定任务深度学习模型的新解释,并提供了一个改善模型性能的新范式。此外,它提醒我们,可以通过信息熵变化影响学习系统的性能。
https://arxiv.org/abs/2309.10625
Reinforcement learning has been increasingly applied in monitoring applications because of its ability to learn from previous experiences and can make adaptive decisions. However, existing machine learning-based health monitoring applications are mostly supervised learning algorithms, trained on labels and they cannot make adaptive decisions in an uncertain complex environment. This study proposes a novel and generic system, predictive deep reinforcement learning (PDRL) with multiple RL agents in a time series forecasting environment. The proposed generic framework accommodates virtual Deep Q Network (DQN) agents to monitor predicted future states of a complex environment with a well-defined reward policy so that the agent learns existing knowledge while maximizing their rewards. In the evaluation process of the proposed framework, three DRL agents were deployed to monitor a subject's future heart rate, respiration, and temperature predicted using a BiLSTM model. With each iteration, the three agents were able to learn the associated patterns and their cumulative rewards gradually increased. It outperformed the baseline models for all three monitoring agents. The proposed PDRL framework is able to achieve state-of-the-art performance in the time series forecasting process. The proposed DRL agents and deep learning model in the PDRL framework are customized to implement the transfer learning in other forecasting applications like traffic and weather and monitor their states. The PDRL framework is able to learn the future states of the traffic and weather forecasting and the cumulative rewards are gradually increasing over each episode.
奖励学习已经在监测应用中越来越广泛应用,因为它能够从以前的经验中学习并做出适应决策。然而,现有的基于机器学习的健康监测应用大多数是监督学习算法,基于标签训练,它们在不确定性复杂的环境中无法做出适应决策。本研究提出了一种新颖且通用系统,预测深度强化学习(PDRL)结合多个RL代理在一个时间序列预测环境中使用。 proposed 通用的框架容纳虚拟的深度 Q 网络(DQN)代理来监测预测复杂环境中的未来状态,并定义明确的奖励政策,使代理学习现有知识,同时最大化其奖励。在评估过程中,三个DRL代理被部署用于监测使用BiLSTM模型预测的一位 subjects 的未来心率、呼吸和温度。每次迭代中,三个代理都能学习相关模式,它们的累积奖励逐渐增加。它超越了所有三个监测代理的基准模型。提出的PDRL框架能够在时间序列预测过程中实现最先进的性能。提出的DRL代理和PDRL框架中的深度学习模型是经过定制的,用于实现交通和天气等其他预测应用中的转移学习,并监测其状态。PDRL框架能够学习交通和天气预报的未来状态,累积奖励在每段中都逐渐增加。
https://arxiv.org/abs/2309.10576
Federated learning (FL) enables edge nodes to collaboratively contribute to constructing a global model without sharing their data. This is accomplished by devices computing local, private model updates that are then aggregated by a server. However, computational resource constraints and network communication can become a severe bottleneck for larger model sizes typical for deep learning applications. Edge nodes tend to have limited hardware resources (RAM, CPU), and the network bandwidth and reliability at the edge is a concern for scaling federated fleet applications. In this paper, we propose and evaluate a FL strategy inspired by transfer learning in order to reduce resource utilization on devices, as well as the load on the server and network in each global training round. For each local model update, we randomly select layers to train, freezing the remaining part of the model. In doing so, we can reduce both server load and communication costs per round by excluding all untrained layer weights from being transferred to the server. The goal of this study is to empirically explore the potential trade-off between resource utilization on devices and global model convergence under the proposed strategy. We implement the approach using the federated learning framework FEDn. A number of experiments were carried out over different datasets (CIFAR-10, CASA, and IMDB), performing different tasks using different deep-learning model architectures. Our results show that training the model partially can accelerate the training process, efficiently utilizes resources on-device, and reduce the data transmission by around 75% and 53% when we train 25%, and 50% of the model layers, respectively, without harming the resulting global model accuracy.
分布式学习(FL)使得边缘节点可以协作贡献构建全球模型,而无需共享数据。这种方法通过计算本地私有模型更新,然后由服务器聚合来实现。然而,计算资源和网络通信可能对深度学习应用中较大模型规模的严重瓶颈。边缘节点通常具有有限的硬件资源(内存、处理器),并且边缘网络带宽和可靠性对于分布式车辆应用 scaling 是一个关注的问题。在本文中,我们提出了并评估了基于迁移学习的FL策略,以降低设备上的资源利用率和全球训练轮中的服务器和网络负载。对于每个本地模型更新,我们随机选择层进行训练,冻结模型的其他部分。通过这样做,我们可以在每个轮中减少服务器负载和通信成本,排除未训练的层权重,从而只将训练过的层权重转移到服务器。本研究的目标是经验地探索设备资源利用率和全球模型收敛在所提出的策略下的潜在权衡。我们使用分布式学习框架Federated Learning框架Fedn来实现方法。在不同的数据集(CIFAR-10、CASA和IMDb)上进行了多项实验,使用不同的深度学习模型架构执行不同的任务。我们的结果显示,部分训练模型可以加速训练过程,高效利用设备上的资源,当训练25%或50%的模型层时,可以分别减少数据传输约75%和53%。而不会损害最终全球模型精度。
https://arxiv.org/abs/2309.10367
In this paper, we explored how to boost speech emotion recognition (SER) with the state-of-the-art speech pre-trained model (PTM), data2vec, text generation technique, GPT-4, and speech synthesis technique, Azure TTS. First, we investigated the representation ability of different speech self-supervised pre-trained models, and we found that data2vec has a good representation ability on the SER task. Second, we employed a powerful large language model (LLM), GPT-4, and emotional text-to-speech (TTS) model, Azure TTS, to generate emotionally congruent text and speech. We carefully designed the text prompt and dataset construction, to obtain the synthetic emotional speech data with high quality. Third, we studied different ways of data augmentation to promote the SER task with synthetic speech, including random mixing, adversarial training, transfer learning, and curriculum learning. Experiments and ablation studies on the IEMOCAP dataset demonstrate the effectiveness of our method, compared with other data augmentation methods, and data augmentation with other synthetic data.
在本文中,我们探讨了如何使用最先进的语音自监督预训练模型(PTM)、数据2vec、文本生成技术、GPT-4和语音合成技术,Azure TTS,来增强语音情感识别( SER)任务。首先,我们研究了不同语音自监督预训练模型的表示能力,并发现数据2vec在 SER 任务中具有良好的表示能力。其次,我们采用了强大的大型语言模型(LLM)、GPT-4和情感文本-语音(TTS)模型,Azure TTS,来生成情感共通的文本和语音。我们 carefully设计了文本prompt和数据集建设,以获得高质量的合成情感语音数据。第三,我们研究了不同数据增强方法,以促进合成语音与模拟语音的任务,包括随机混合、对抗训练、转移学习和课程学习。在IEMOCAP数据集上的实验和 ablation研究证明了我们方法的有效性,与其他数据增强方法相比,以及与其他模拟数据增强方法相比。
https://arxiv.org/abs/2309.10294
Semantic segmentation (classification) of Earth Observation imagery is a crucial task in remote sensing. This paper presents a comprehensive review of technical factors to consider when designing neural networks for this purpose. The review focuses on Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Generative Adversarial Networks (GANs), and transformer models, discussing prominent design patterns for these ANN families and their implications for semantic segmentation. Common pre-processing techniques for ensuring optimal data preparation are also covered. These include methods for image normalization and chipping, as well as strategies for addressing data imbalance in training samples, and techniques for overcoming limited data, including augmentation techniques, transfer learning, and domain adaptation. By encompassing both the technical aspects of neural network design and the data-related considerations, this review provides researchers and practitioners with a comprehensive and up-to-date understanding of the factors involved in designing effective neural networks for semantic segmentation of Earth Observation imagery.
对地球观测图像语义分割(分类)的主要任务是遥感领域的关键任务。本文综述了在设计为此目的神经网络时需要考虑的技术因素。本文关注卷积神经网络(CNNs)、循环神经网络(RNNs)、生成对抗网络(GANs)和Transformer模型的主要设计模式,并讨论这些ANN家族的特点和对语义分割的影响。此外,还涵盖了确保最佳数据准备的常见预处理技术。这些技术包括图像归一化和芯片化方法,以及在训练样本中解决数据不平衡的策略,以及克服数据有限的方法,包括增强技术、迁移学习和跨域适应。通过涵盖神经网络设计和数据相关考虑两方面,本文为研究人员和从业者提供了全面而最新的理解,以设计对地球观测图像语义分割有效的神经网络。
https://arxiv.org/abs/2308.09221
A common practice in metric learning is to train and test an embedding model for each dataset. This dataset-specific approach fails to simulate real-world scenarios that involve multiple heterogeneous distributions of data. In this regard, we introduce a novel metric learning paradigm, called Universal Metric Learning (UML), which learns a unified distance metric capable of capturing relations across multiple data distributions. UML presents new challenges, such as imbalanced data distribution and bias towards dominant distributions. To address these challenges, we propose Parameter-efficient Universal Metric leArning (PUMA), which consists of a pre-trained frozen model and two additional modules, stochastic adapter and prompt pool. These modules enable to capture dataset-specific knowledge while avoiding bias towards dominant distributions. Additionally, we compile a new universal metric learning benchmark with a total of 8 different datasets. PUMA outperformed the state-of-the-art dataset-specific models while using about 69 times fewer trainable parameters.
在度量学习中,一种常见的实践是为每个数据集训练和测试嵌入模型。这种方法无法模拟涉及多个异质数据分布的现实世界场景。为此,我们提出了一种新度量学习范式,称为通用度量学习(UML),它学习一种能够捕捉多个数据分布之间的关系的统一度量。UML面临新的问题,例如数据分布不平衡和偏向主导分布的倾向。为了解决这些问题,我们提出了参数高效的通用度量学习框架(PUMA),它包括一个预先训练的冻结模型和一个额外的模块,随机适配器和即时池。这些模块可以捕捉数据集特定知识,同时避免偏向主导分布的倾向。此外,我们汇编了总共8个不同的数据集的新通用度量学习基准。Puma比最先进的数据集特定模型表现更好,同时使用约69倍 fewer可训练参数。
https://arxiv.org/abs/2309.08944
In modern commercial search engines and recommendation systems, data from multiple domains is available to jointly train the multi-domain model. Traditional methods train multi-domain models in the multi-task setting, with shared parameters to learn the similarity of multiple tasks, and task-specific parameters to learn the divergence of features, labels, and sample distributions of individual tasks. With the development of large language models, LLM can extract global domain-invariant text features that serve both search and recommendation tasks. We propose a novel framework called S\&R Multi-Domain Foundation, which uses LLM to extract domain invariant features, and Aspect Gating Fusion to merge the ID feature, domain invariant text features and task-specific heterogeneous sparse features to obtain the representations of query and item. Additionally, samples from multiple search and recommendation scenarios are trained jointly with Domain Adaptive Multi-Task module to obtain the multi-domain foundation model. We apply the S\&R Multi-Domain foundation model to cold start scenarios in the pretrain-finetune manner, which achieves better performance than other SOTA transfer learning methods. The S\&R Multi-Domain Foundation model has been successfully deployed in Alipay Mobile Application's online services, such as content query recommendation and service card recommendation, etc.
在现代商业搜索引擎和推荐系统中,可以从多个领域获取数据,以共同训练多领域模型。传统方法在多任务环境下训练多领域模型,使用共享参数学习多个任务相似性,使用任务特定参数学习单个任务的各自特征、标签和样本分布。随着大型语言模型的发展,LM可以提取全球跨领域的文本特征,用于搜索和推荐任务。我们提出了一种名为S\&R Multi-Domain Foundation的新框架,该框架使用LM提取跨领域特征,并使用Aspect Gating Fusion将ID特征、跨领域文本特征和任务特定稀疏特征合并,以获得查询和物品的表示。此外,多个搜索和推荐场景的样本与跨任务自适应多任务模块一起训练,以获得多领域基础模型。我们将S\&R Multi-Domain Foundation模型应用于热身场景,在预训练和微调过程中,其表现比其他SOTA迁移学习方法更好。S\&R Multi-Domain Foundation模型已经成功部署在Alipay Mobile Application的在线服务中,如内容查询推荐和服务卡片推荐等。
https://arxiv.org/abs/2309.08939
Breast cancer is one of the leading causes of death for women worldwide. Early screening is essential for early identification, but the chance of survival declines as the cancer progresses into advanced stages. For this study, the most recent BRACS dataset of histological (H\&E) stained images was used to classify breast cancer tumours, which contains both the whole-slide images (WSI) and region-of-interest (ROI) images, however, for our study we have considered ROI images. We have experimented using different pre-trained deep learning models, such as Xception, EfficientNet, ResNet50, and InceptionResNet, pre-trained on the ImageNet weights. We pre-processed the BRACS ROI along with image augmentation, upsampling, and dataset split strategies. For the default dataset split, the best results were obtained by ResNet50 achieving 66\% f1-score. For the custom dataset split, the best results were obtained by performing upsampling and image augmentation which results in 96.2\% f1-score. Our second approach also reduced the number of false positive and false negative classifications to less than 3\% for each class. We believe that our study significantly impacts the early diagnosis and identification of breast cancer tumors and their subtypes, especially atypical and malignant tumors, thus improving patient outcomes and reducing patient mortality rates. Overall, this study has primarily focused on identifying seven (7) breast cancer tumor subtypes, and we believe that the experimental models can be fine-tuned further to generalize over previous breast cancer histology datasets as well.
乳腺癌是世界各地女性死亡的主要原因之一。早期筛查对于早期识别至关重要,但癌症进入 Advanced stages 后的生存机会则会逐渐减少。本研究使用最新的BRACS(H&E)染色的解剖图像数据集来分类乳腺癌肿瘤,其中包含整张图像(WSI)和区域感兴趣的图像(ROI),但对于本研究,我们考虑了 ROI 图像。我们尝试使用不同的预训练深度学习模型,例如 Xception、EfficientNet、ResNet50 和 InceptionResNet,在 ImageNet 权重上预训练。我们对 BRACS ROI 进行预处理,并使用图像增强、扩增和数据集分割策略。对于默认的数据集分割,ResNet50 取得了最佳结果,实现了 66\% f1-score。对于自定义的数据集分割,最好的结果是通过图像增强和扩增获得,结果为 96.2\% f1-score。我们的第二种方法也减少了每个类别中的 false 阳性和 false 阴性分类的数量,不到 3\%。我们相信,我们的研究对早期诊断和识别乳腺癌肿瘤及其子类型、特别是非典型性和恶性肿瘤的影响非常重要,从而改善患者结果并降低患者死亡率。总体而言,本研究主要关注识别七个(7)乳腺癌肿瘤子类型,我们相信实验模型可以进一步优化,以概括先前乳腺癌解剖图像数据集。
https://arxiv.org/abs/2309.08745
Accurate drone detection is strongly desired in drone collision avoidance, drone defense and autonomous Unmanned Aerial Vehicle (UAV) self-landing. With the recent emergence of the Vision Transformer (ViT), this critical task is reassessed in this paper using a UAV dataset composed of 1359 drone photos. We construct various CNN and ViT-based models, demonstrating that for single-drone detection, a basic ViT can achieve performance 4.6 times more robust than our best CNN-based transfer learning models. By implementing the state-of-the-art You Only Look Once (YOLO v7, 200 epochs) and the experimental ViT-based You Only Look At One Sequence (YOLOS, 20 epochs) in multi-drone detection, we attain impressive 98% and 96% mAP values, respectively. We find that ViT outperforms CNN at the same epoch, but also requires more training data, computational power, and sophisticated, performance-oriented designs to fully surpass the capabilities of cutting-edge CNN detectors. We summarize the distinct characteristics of ViT and CNN models to aid future researchers in developing more efficient deep learning models.
在无人机避免碰撞、无人机防御和自主无人飞行器(UAV)自主着陆方面,精确无人机检测是非常重要的目标。随着视觉转换器(ViT)的出现,本文使用由1359幅无人机照片组成的UAV数据集,重新评估了这一关键任务。我们构建了许多卷积神经网络和ViT基于模型,证明对于单无人机检测,一个基本的ViT可以实现性能4.6倍更为 robust 的我们的最好卷积神经网络迁移学习模型。通过在多无人机检测中实现最先进的You Only Look Once(YOLO v7,200 epochs)和ViT基于的You Only Look At One序列(YOLOS,20 epochs),我们获得了令人印象深刻的98%和96%的mAP值, respectively。我们发现,在同一 epoch 中,ViT比卷积神经网络表现更好,但需要更多的训练数据、计算资源和 sophisticated 以性能为中心的设计,才能完全超越最先进的卷积神经网络检测器的能力。我们总结了ViT和卷积神经网络的显著特征,以帮助未来的研究人员开发更高效的深度学习模型。
https://arxiv.org/abs/2308.09899
Label-free cell classification is advantageous for supplying pristine cells for further use or examination, yet existing techniques frequently fall short in terms of specificity and speed. In this study, we address these limitations through the development of a novel machine learning framework, Multiplex Image Machine Learning (MIML). This architecture uniquely combines label-free cell images with biomechanical property data, harnessing the vast, often underutilized morphological information intrinsic to each cell. By integrating both types of data, our model offers a more holistic understanding of the cellular properties, utilizing morphological information typically discarded in traditional machine learning models. This approach has led to a remarkable 98.3\% accuracy in cell classification, a substantial improvement over models that only consider a single data type. MIML has been proven effective in classifying white blood cells and tumor cells, with potential for broader application due to its inherent flexibility and transfer learning capability. It's particularly effective for cells with similar morphology but distinct biomechanical properties. This innovative approach has significant implications across various fields, from advancing disease diagnostics to understanding cellular behavior.
无标签细胞分类有利于提供崭新的细胞供后续使用或检查,但现有技术常常在特异性和速度方面存在缺陷。在本研究中,我们采用了一种 novel 机器学习框架,即多通道图像机器学习(MIML),该架构独特地结合了无标签细胞图像和生物力学属性数据,利用每个细胞固有的大量未被充分利用的形态信息。通过整合两种数据类型,我们的模型提供了更综合的细胞特性理解,利用传统机器学习模型通常废弃的形态信息。这种方法导致了细胞分类的卓越准确率,比仅考虑一种数据类型的模型有了显著改进。MIML已被证明在分类白血细胞和肿瘤细胞有效,由于其固有的灵活性和转移学习能力,其应用范围有望扩大。它对具有类似形态但不同生物力学性质的细胞特别有效。这一创新性的方法在多个领域都有着重要的影响,包括推进疾病诊断和理解细胞行为。
https://arxiv.org/abs/2309.08421
Recently, large-scale pre-trained language-image models like CLIP have shown extraordinary capabilities for understanding spatial contents, but naively transferring such models to video recognition still suffers from unsatisfactory temporal modeling capabilities. Existing methods insert tunable structures into or in parallel with the pre-trained model, which either requires back-propagation through the whole pre-trained model and is thus resource-demanding, or is limited by the temporal reasoning capability of the pre-trained structure. In this work, we present DiST, which disentangles the learning of spatial and temporal aspects of videos. Specifically, DiST uses a dual-encoder structure, where a pre-trained foundation model acts as the spatial encoder, and a lightweight network is introduced as the temporal encoder. An integration branch is inserted between the encoders to fuse spatio-temporal information. The disentangled spatial and temporal learning in DiST is highly efficient because it avoids the back-propagation of massive pre-trained parameters. Meanwhile, we empirically show that disentangled learning with an extra network for integration benefits both spatial and temporal understanding. Extensive experiments on five benchmarks show that DiST delivers better performance than existing state-of-the-art methods by convincing gaps. When pre-training on the large-scale Kinetics-710, we achieve 89.7% on Kinetics-400 with a frozen ViT-L model, which verifies the scalability of DiST. Codes and models can be found in this https URL.
最近,像CLIP这样的大规模预训练语言图像模型已经展现出了理解空间内容非凡的能力,但是将这些模型简单地迁移到视频识别仍然面临着时间建模能力不够满意的问题。现有的方法会插入可调整的结构到或与预训练模型并行的位置,这些结构要么需要在整个预训练模型上进行反向传播,因此资源需求很高,要么受限于预训练结构的时间推理能力。在本文中,我们提出了DiST,它分离了视频空间和时间方面的学习。具体来说,DiST使用了一个双重编码结构,其中预训练的基础模型作为空间编码器,而轻量级网络则作为时间编码器。在编码器之间插入一个集成分支来融合时间和空间信息。DiST分离的空间和时间学习非常高效,因为它避免了大量预训练参数的反向传播。同时,我们经验证地证明了通过集成学习分离学习的好处,这对空间时间和理解都有好处。在五个基准点上进行了广泛的实验,结果表明DiST比现有的先进方法表现更好。在大规模预训练Kincs-710时,我们使用 frozen ViT-L模型在Kincs-400上达到了89.7%,这验证了DiST的可扩展性。代码和模型可以在本页的 https URL中找到。
https://arxiv.org/abs/2309.07911
A recent trend in deep learning algorithms has been towards training large scale models, having high parameter count and trained on big dataset. However, robustness of such large scale models towards real-world settings is still a less-explored topic. In this work, we first benchmark the performance of these models under different perturbations and datasets thereby representing real-world shifts, and highlight their degrading performance under these shifts. We then discuss on how complete model fine-tuning based existing robustification schemes might not be a scalable option given very large scale networks and can also lead them to forget some of the desired characterstics. Finally, we propose a simple and cost-effective method to solve this problem, inspired by knowledge transfer literature. It involves robustifying smaller models, at a lower computation cost, and then use them as teachers to tune a fraction of these large scale networks, reducing the overall computational overhead. We evaluate our proposed method under various vision perturbations including ImageNet-C,R,S,A datasets and also for transfer learning, zero-shot evaluation setups on different datasets. Benchmark results show that our method is able to induce robustness to these large scale models efficiently, requiring significantly lower time and also preserves the transfer learning, zero-shot properties of the original model which none of the existing methods are able to achieve.
深度学习算法中的最新趋势是训练大型模型,拥有高参数量,并在大型数据集上训练。然而,对于这些大型模型在真实场景中的鲁棒性,仍然是一个未深入研究的话题。在本文中,我们首先基准了这些模型在不同干扰和数据集上的表现,从而代表真实场景的变化,并突出了在这些变化下这些模型的性能下降。然后我们讨论了如何基于现有鲁棒增强方案完成完整的模型微调可能不再是可扩展的选择,因为这些网络规模非常大,并且也可能让他们忘记一些期望的特征。最后,我们提出了一种简单且成本效益高的解决这个问题的方法,基于知识转移文献的灵感。它涉及对较小的模型进行鲁棒增强,以较低的计算成本,然后将它们用作老师,微调这些大型网络的一小部分,减少整个计算 overhead。我们在不同的视觉干扰下评估了我们提出的方法和在不同数据集上的 transfer learning、零样本评估 setup 的方法。基准结果表明,我们的方法能够 efficiently 对这些大型模型进行鲁棒性增强,所需时间显著缩短,并保留了原始模型的转移学习和零样本特性,这是现有方法无法达到的。
https://arxiv.org/abs/2309.07499