Precise crop yield prediction is essential for improving agricultural practices and ensuring crop resilience in varying climates. Integrating weather data across the growing season, especially for different crop varieties, is crucial for understanding their adaptability in the face of climate change. In the MLCAS2021 Crop Yield Prediction Challenge, we utilized a dataset comprising 93,028 training records to forecast yields for 10,337 test records, covering 159 locations across 28 U.S. states and Canadian provinces over 13 years (2003-2015). This dataset included details on 5,838 distinct genotypes and daily weather data for a 214-day growing season, enabling comprehensive analysis. As one of the winning teams, we developed two novel convolutional neural network (CNN) architectures: the CNN-DNN model, combining CNN and fully-connected networks, and the CNN-LSTM-DNN model, with an added LSTM layer for weather variables. Leveraging the Generalized Ensemble Method (GEM), we determined optimal model weights, resulting in superior performance compared to baseline models. The GEM model achieved lower RMSE (5.55% to 39.88%), reduced MAE (5.34% to 43.76%), and higher correlation coefficients (1.1% to 10.79%) when evaluated on test data. We applied the CNN-DNN model to identify top-performing genotypes for various locations and weather conditions, aiding genotype selection based on weather variables. Our data-driven approach is valuable for scenarios with limited testing years. Additionally, a feature importance analysis using RMSE change highlighted the significance of location, MG, year, and genotype, along with the importance of weather variables MDNI and AP.
精确的 crop yield 预测对于改善农业实践和确保在不同气候条件下的作物韧性至关重要。将气象数据在整个生长季节中整合,特别是针对不同作物 variety 的气象数据,对于理解它们在气候变化面前的适应能力至关重要。在 MLCAS2021 crop yield 预测挑战中,我们使用了一个包含 93,028 个训练记录的数据集,用于预测 10,337 个测试记录的 yield,覆盖了 28 个美国州和加拿大省在 13 年(2003-2015)中的 159 个地点。这个数据集包含了关于 5,838 个 distinct genotypes 和每日气象数据的详细情况,使能够进行全面分析。作为获胜团队之一,我们开发了两种新的卷积神经网络 (CNN) 架构:CNN-DNN 模型,将 CNN 和全连接网络相结合,以及 CNN-LSTM-DNN 模型,并在气象变量方面增加了 LSTM 层。利用通用群集方法 (gem),我们确定了最佳的模型权重,从而导致与基准模型相比更好的性能。gem 模型在测试数据上的 RMSE 降低到了 5.55% 到 39.88%,MAE 降低到了 5.34% 到 43.76%,并更高的 correlation 系数 (1.1% 到 10.79%)。我们应用了 CNN-DNN 模型来确定各种地点和气象条件的顶级表现 genotypes,并根据气象变量进行 genotypes 选择。我们的数据驱动方法对于测试年份有限的情况非常有价值。此外,使用 RMSE 变化的特征重要性分析强调了地点、MG、年份和 genotypes 的重要性,以及气象变量 MDNI 和 AP 的重要性。
https://arxiv.org/abs/2309.13021
The detailed images produced by Magnetic Resonance Imaging (MRI) provide life-critical information for the diagnosis and treatment of prostate cancer. To provide standardized acquisition, interpretation and usage of the complex MRI images, the PI-RADS v2 guideline was proposed. An automated segmentation following the guideline facilitates consistent and precise lesion detection, staging and treatment. The guideline recommends a division of the prostate into four zones, PZ (peripheral zone), TZ (transition zone), DPU (distal prostatic urethra) and AFS (anterior fibromuscular stroma). Not every zone shares a boundary with the others and is present in every slice. Further, the representations captured by a single model might not suffice for all zones. This motivated us to design a dual-branch convolutional neural network (CNN), where each branch captures the representations of the connected zones separately. Further, the representations from different branches act complementary to each other at the second stage of training, where they are fine-tuned through an unsupervised loss. The loss penalises the difference in predictions from the two branches for the same class. We also incorporate multi-task learning in our framework to further improve the segmentation accuracy. The proposed approach improves the segmentation accuracy of the baseline (mean absolute symmetric distance) by 7.56%, 11.00%, 58.43% and 19.67% for PZ, TZ, DPU and AFS zones respectively.
磁共振成像(MRI)生成的详细图像为前列腺癌的诊断和治疗提供了生命中最重要的信息。为了提供标准化的获取、解释和使用复杂的MRI图像的标准操作,我们提出了PI-RADS v2指南。遵循指南的自动分割有助于一致性和精确的 Lesion 检测、分期和治疗。指南建议将前列腺癌分为四个区域,PZ(周围区域)、TZ(过渡区域)、DPU(远程前列腺癌尿管)和AFS(前部肌肉基质)。不是每个区域都与其他人共享边界,每个切片都包含。此外,一个模型 captured 的表示可能不足以涵盖所有区域。这激励我们设计一种双分支卷积神经网络(CNN),其中每个分支分别捕获连接区域的表示。此外,不同分支的表示在训练的第二阶段相互补充,通过无监督损失进行优化。该损失惩罚两个分支对同一类预测的差异。我们还在我们的框架中引入了多任务学习,以进一步改进分割精度。我们建议的方法改进了基线(平均绝对对称距离)的分割精度,分别为7.56%、11.00%、58.43%、19.67% PZ、TZ、DPU和AFS区域。
https://arxiv.org/abs/2309.12970
In this work, we investigate two popular end-to-end automatic speech recognition (ASR) models, namely Connectionist Temporal Classification (CTC) and RNN-Transducer (RNN-T), for offline recognition of voice search queries, with up to 2B model parameters. The encoders of our models use the neural architecture of Google's universal speech model (USM), with additional funnel pooling layers to significantly reduce the frame rate and speed up training and inference. We perform extensive studies on vocabulary size, time reduction strategy, and its generalization performance on long-form test sets. Despite the speculation that, as the model size increases, CTC can be as good as RNN-T which builds label dependency into the prediction, we observe that a 900M RNN-T clearly outperforms a 1.8B CTC and is more tolerant to severe time reduction, although the WER gap can be largely removed by LM shallow fusion.
在本研究中,我们研究了两种流行的端到端自动语音识别(ASR)模型,即连接性时间分类(CTC)和RNN-控制器(RNN-T),以 offline 识别语音搜索查询,使用高达2B的模型参数。我们模型的编码器使用谷歌的通用语音模型(USM)的神经网络架构,并添加 funnel Pooling 层来显著降低帧率,加快训练和推断。我们深入研究了词汇量、时间减少策略以及在长篇测试集上的通用表现。尽管有人猜测,随着模型规模的增长,CTC可能不亚于 RNN-T,它将标签依赖项引入预测中,但我们观察到,一个900M的RNN-T明显 outperforms a 1.8B的CTC,并且更加容忍严重的时间减少,尽管通过LM浅融合可以大部分消除WER之间的差距。
https://arxiv.org/abs/2309.12963
Weakly supervised object localization and semantic segmentation aim to localize objects using only image-level labels. Recently, a new paradigm has emerged by generating a foreground prediction map (FPM) to achieve pixel-level localization. While existing FPM-based methods use cross-entropy to evaluate the foreground prediction map and to guide the learning of the generator, this paper presents two astonishing experimental observations on the object localization learning process: For a trained network, as the foreground mask expands, 1) the cross-entropy converges to zero when the foreground mask covers only part of the object region. 2) The activation value continuously increases until the foreground mask expands to the object boundary. Therefore, to achieve a more effective localization performance, we argue for the usage of activation value to learn more object regions. In this paper, we propose a Background Activation Suppression (BAS) method. Specifically, an Activation Map Constraint (AMC) module is designed to facilitate the learning of generator by suppressing the background activation value. Meanwhile, by using foreground region guidance and area constraint, BAS can learn the whole region of the object. In the inference phase, we consider the prediction maps of different categories together to obtain the final localization results. Extensive experiments show that BAS achieves significant and consistent improvement over the baseline methods on the CUB-200-2011 and ILSVRC datasets. In addition, our method also achieves state-of-the-art weakly supervised semantic segmentation performance on the PASCAL VOC 2012 and MS COCO 2014 datasets. Code and models are available at this https URL.
弱监督的对象定位和语义分割旨在使用图像级别标签只定位对象。最近,出现了一种新范式,通过生成前景预测图(FPM)来实现像素级定位。虽然现有的FPM相关方法使用交叉熵来评估前景预测图和指导生成器学习,但本文提出了对对象定位学习过程的两个惊人的实验观察:对于训练网络,当前景掩膜扩展时,1)交叉熵收敛到零,当前景掩膜仅覆盖对象区域的一部分时。2)激活值持续增加,直到前景掩膜扩展到对象边界。因此,为了实现更有效的定位表现,我们主张使用激活值来学习更多的对象区域。在本文中,我们提出了一种背景激活抑制(BAS)方法。具体来说,一个激活图约束(AMC)模块旨在抑制背景激活值,以促进生成器学习。同时,通过使用前景区域指导和监督区域大小,BAS可以学习整个对象区域。在推理阶段,我们考虑不同类别的预测图一起获得最终定位结果。广泛的实验表明,BAS在CUB-200-2011和ILSVRC数据集上实现了显著和一致性的提高,与基准方法相比。此外,我们的方法还在PASCAL VOC 2012和MS COCO 2014数据集上实现了弱监督语义分割性能的顶尖水平。代码和模型在此httpsURL上可用。
https://arxiv.org/abs/2309.12943
Saliency maps have become one of the most widely used interpretability techniques for convolutional neural networks (CNN) due to their simplicity and the quality of the insights they provide. However, there are still some doubts about whether these insights are a trustworthy representation of what CNNs use to come up with their predictions. This paper explores how rescuing the sign of the gradients from the saliency map can lead to a deeper understanding of multi-class classification problems. Using both pretrained and trained from scratch CNNs we unveil that considering the sign and the effect not only of the correct class, but also the influence of the other classes, allows to better identify the pixels of the image that the network is really focusing on. Furthermore, how occluding or altering those pixels is expected to affect the outcome also becomes clearer.
视觉相关度图(Saliency Map)已经成为卷积神经网络(CNN)中最常用的可解释技术之一,因为它们的简单易用以及它们提供的深刻 insights。然而,仍然存在一些疑问,这些 insights 是否足以代表 CNN 用来预测结果的可信表示。本文探讨了如何从视觉相关度图中提取梯度 sign 并深入理解多分类问题。使用预训练和从零开始训练的 CNN 模型,我们发现不仅考虑正确的类别及其影响,还要考虑其他类别的影响,可以更好地确定网络真正关注的图像像素。此外,如何遮蔽或改变这些像素预计会影响结果也变得更加清晰。
https://arxiv.org/abs/2309.12913
Affect recognition, encompassing emotions, moods, and feelings, plays a pivotal role in human communication. In the realm of conversational artificial intelligence (AI), the ability to discern and respond to human affective cues is a critical factor for creating engaging and empathetic interactions. This study delves into the capacity of large language models (LLMs) to recognise human affect in conversations, with a focus on both open-domain chit-chat dialogues and task-oriented dialogues. Leveraging three diverse datasets, namely IEMOCAP, EmoWOZ, and DAIC-WOZ, covering a spectrum of dialogues from casual conversations to clinical interviews, we evaluated and compared LLMs' performance in affect recognition. Our investigation explores the zero-shot and few-shot capabilities of LLMs through in-context learning (ICL) as well as their model capacities through task-specific fine-tuning. Additionally, this study takes into account the potential impact of automatic speech recognition (ASR) errors on LLM predictions. With this work, we aim to shed light on the extent to which LLMs can replicate human-like affect recognition capabilities in conversations.
情感识别在人类沟通中扮演着关键角色。在对话型人工智能(AI)领域,能够分辨和响应人类情感 cues 是创造有趣和感同身受的交互的关键因素。本文探讨了大型语言模型(LLM)在对话中识别人类情感的能力,重点研究了公开领域的闲聊对话和任务驱动的对话。利用三个不同的数据集,包括IEMOCAP、EmoWOZ和DAIC-WOZ,涵盖了从闲聊对话到临床访谈的一系列对话,我们评估了和比较了LLM在情感识别方面的表现。我们的研究探索了LLM通过上下文学习(ICL)的零Shot和少量Shot能力,以及通过任务特定微调来提高其模型能力。此外,本文考虑到了自动语音识别(ASR)错误对LLM预测的潜在影响。通过这项工作,我们旨在阐明LLM在对话中能否模拟人类情感识别能力的局限性。
https://arxiv.org/abs/2309.12881
Sequential recommendation (SRS) has become the technical foundation in many applications recently, which aims to recommend the next item based on the user's historical interactions. However, sequential recommendation often faces the problem of data sparsity, which widely exists in recommender systems. Besides, most users only interact with a few items, but existing SRS models often underperform these users. Such a problem, named the long-tail user problem, is still to be resolved. Data augmentation is a distinct way to alleviate these two problems, but they often need fabricated training strategies or are hindered by poor-quality generated interactions. To address these problems, we propose a Diffusion Augmentation for Sequential Recommendation (DiffuASR) for a higher quality generation. The augmented dataset by DiffuASR can be used to train the sequential recommendation models directly, free from complex training procedures. To make the best of the generation ability of the diffusion model, we first propose a diffusion-based pseudo sequence generation framework to fill the gap between image and sequence generation. Then, a sequential U-Net is designed to adapt the diffusion noise prediction model U-Net to the discrete sequence generation task. At last, we develop two guide strategies to assimilate the preference between generated and origin sequences. To validate the proposed DiffuASR, we conduct extensive experiments on three real-world datasets with three sequential recommendation models. The experimental results illustrate the effectiveness of DiffuASR. As far as we know, DiffuASR is one pioneer that introduce the diffusion model to the recommendation.
Sequential recommendation (SRS)已经成为许多应用的技术基础,其目标是基于用户的历史交互推荐下一个物品。然而,Sequential recommendation常常面临数据稀疏的问题,这个问题在推荐系统中很常见。此外,大多数用户只与少数物品交互,但现有的SRS模型往往在这些用户上表现不好,这种情况被称为长尾用户问题,仍需要解决。数据增强是一种缓解这两个问题的独特方法,但常常需要编造训练策略或受到生成 interactions 的质量阻碍。为了解决这些问题,我们提出了一种Sequential Recommendation中的扩散增强(DiffuASR),以提供更高质量的生成。通过DiffuASR增强的数据集可以用于直接训练Sequential recommendation模型,而无需复杂的训练程序。为了最大限度地发挥扩散模型的生成能力,我们首先提出了一种基于扩散的伪序列生成框架,以填补图像和序列生成之间的空缺。然后,我们设计了Sequential U-Net,以适应扩散噪声预测模型U-Net的离散序列生成任务。最后,我们发展了两个指导策略,以整合生成和起源序列的偏好。为了验证所提出的DiffuASR,我们针对三个实际数据集和三个Sequential recommendation模型进行了广泛的实验。实验结果展示了DiffuASR的有效性。据所知,DiffuASR是将扩散模型引入推荐领域的先驱之一。
https://arxiv.org/abs/2309.12858
With the rapid advances in high-throughput sequencing technologies, the focus of survival analysis has shifted from examining clinical indicators to incorporating genomic profiles with pathological images. However, existing methods either directly adopt a straightforward fusion of pathological features and genomic profiles for survival prediction, or take genomic profiles as guidance to integrate the features of pathological images. The former would overlook intrinsic cross-modal correlations. The latter would discard pathological information irrelevant to gene expression. To address these issues, we present a Cross-Modal Translation and Alignment (CMTA) framework to explore the intrinsic cross-modal correlations and transfer potential complementary information. Specifically, we construct two parallel encoder-decoder structures for multi-modal data to integrate intra-modal information and generate cross-modal representation. Taking the generated cross-modal representation to enhance and recalibrate intra-modal representation can significantly improve its discrimination for comprehensive survival analysis. To explore the intrinsic crossmodal correlations, we further design a cross-modal attention module as the information bridge between different modalities to perform cross-modal interactions and transfer complementary information. Our extensive experiments on five public TCGA datasets demonstrate that our proposed framework outperforms the state-of-the-art methods.
随着高分辨率测序技术的迅速发展,生存分析的重点已经从检查临床指标转移到结合病理图像的基因组 profiles 。然而,现有的方法要么直接采用一种简单直接的融合病理特征和基因组 profiles 进行生存预测的方法,要么将基因组 profiles 作为指导,以整合病理图像的特征。前者会忽略内在的跨模态 correlation 。后者则会丢弃与基因表达无关的病理信息。为了解决这些问题,我们提出了一种跨模态翻译和对齐(CMTA)框架,以探索内在的跨模态 correlation 和传输潜在的互补信息。具体而言,我们构建了两个并行的编码-解码结构,对多模态数据整合内部模态信息,生成跨模态表示。利用生成的跨模态表示来提高和重新校准内部模态表示,可以显著改善全面生存分析的区分性。为了探索内在的跨模态 correlation,我们进一步设计了一个跨模态注意模块,作为不同模态之间的信息桥梁,进行跨模态交互和传输互补信息。我们对五个公开的 TCGA 数据集进行了广泛的实验,证明了我们提出的框架比当前最先进的方法表现更好。
https://arxiv.org/abs/2309.12855
The representations of neural networks are often compared to those of biological systems by performing regression between the neural network responses and those measured from biological systems. Many different state-of-the-art deep neural networks yield similar neural predictions, but it remains unclear how to differentiate among models that perform equally well at predicting neural responses. To gain insight into this, we use a recent theoretical framework that relates the generalization error from regression to the spectral bias of the model activations and the alignment of the neural responses onto the learnable subspace of the model. We extend this theory to the case of regression between model activations and neural responses, and define geometrical properties describing the error embedding geometry. We test a large number of deep neural networks that predict visual cortical activity and show that there are multiple types of geometries that result in low neural prediction error as measured via regression. The work demonstrates that carefully decomposing representational metrics can provide interpretability of how models are capturing neural activity and points the way towards improved models of neural activity.
神经网络的表示经常被用来与生物系统的表示进行比较,通过比较神经网络响应与从生物系统测量的响应之间的回归结果。许多最先进的深度学习网络都产生了类似的神经网络预测结果,但如何区分在预测神经网络响应方面表现同样出色的模型仍然是一个问题。为了解决这个问题,我们使用了一个最近的理论框架,该框架将回归 generalization 误差与模型激活函数的谱偏差以及将神经网络响应对齐到模型可学习子空间的几何性质联系起来。我们将这个理论扩展到模型激活值和神经网络响应之间的回归情况,并定义了描述错误嵌入几何性质的几何性质。我们测试了大量预测视觉皮层活动的深度学习网络,并表明有多种类型的几何形状会导致通过回归测量的神经网络预测误差较低。这项工作表明,仔细分解表示度量可以提供模型如何捕捉神经网络活动的可解释性,并指向改进神经网络活动模型的方向。
https://arxiv.org/abs/2309.12821
As a common approach to learning English, reading comprehension primarily entails reading articles and answering related questions. However, the complexity of designing effective exercises results in students encountering standardized questions, making it challenging to align with individualized learners' reading comprehension ability. By leveraging the advanced capabilities offered by large language models, exemplified by ChatGPT, this paper presents a novel personalized support system for reading comprehension, referred to as ChatPRCS, based on the Zone of Proximal Development theory. ChatPRCS employs methods including reading comprehension proficiency prediction, question generation, and automatic evaluation, among others, to enhance reading comprehension instruction. First, we develop a new algorithm that can predict learners' reading comprehension abilities using their historical data as the foundation for generating questions at an appropriate level of difficulty. Second, a series of new ChatGPT prompt patterns is proposed to address two key aspects of reading comprehension objectives: question generation, and automated evaluation. These patterns further improve the quality of generated questions. Finally, by integrating personalized ability and reading comprehension prompt patterns, ChatPRCS is systematically validated through experiments. Empirical results demonstrate that it provides learners with high-quality reading comprehension questions that are broadly aligned with expert-crafted questions at a statistical level.
作为学习英语的普遍方法,阅读理解主要涉及阅读文章并回答问题。然而,设计有效的练习会导致学生遇到标准化问题,这使得个性化学生的阅读理解能力难以匹配。通过利用大型语言模型的代表 ChatGPT 提供的先进能力,本文提出了一种名为 ChatPRCS 的个人化阅读支持系统,基于渐进发展理论的区。ChatPRCS 采用的方法包括阅读理解能力预测、问题生成和自动评估,以增强阅读理解指示。首先,我们开发了一种新算法,可以使用历史数据作为生成问题的基础,以在适当的难度级别上生成问题。其次,我们提出了一系列新的 ChatGPT 提示模式,以解决阅读理解目标的的两个关键方面:问题生成和自动评估。这些模式进一步改进了生成的问题的质量和质量。最后,通过整合个人能力和阅读理解提示模式,ChatPRCS 通过实验系统性地验证。实证结果表明,它提供高质量的阅读理解问题,这些问题的答案在统计层面上广泛与专家创建的问题一致。
https://arxiv.org/abs/2309.12808
This research introduces an enhanced version of the multi-objective speech assessment model, called MOSA-Net+, by leveraging the acoustic features from large pre-trained weakly supervised models, namely Whisper, to create embedding features. The first part of this study investigates the correlation between the embedding features of Whisper and two self-supervised learning (SSL) models with subjective quality and intelligibility scores. The second part evaluates the effectiveness of Whisper in deploying a more robust speech assessment model. Third, the possibility of combining representations from Whisper and SSL models while deploying MOSA-Net+ is analyzed. The experimental results reveal that Whisper's embedding features correlate more strongly with subjective quality and intelligibility than other SSL's embedding features, contributing to more accurate prediction performance achieved by MOSA-Net+. Moreover, combining the embedding features from Whisper and SSL models only leads to marginal improvement. As compared to MOSA-Net and other SSL-based speech assessment models, MOSA-Net+ yields notable improvements in estimating subjective quality and intelligibility scores across all evaluation metrics. We further tested MOSA-Net+ on Track 3 of the VoiceMOS Challenge 2023 and obtained the top-ranked performance.
这项研究介绍了一种增强版本的多目标语音评估模型,称为MOSA-Net+,通过利用大型弱监督预训练模型Whisper的声学特征来创建嵌入特征。本研究第一部分研究了Whisper的嵌入特征与两个基于自我监督学习(SSL)模型的主观质量和语音识别得分之间的相关性。本研究第二部分评估了Whisper在部署更稳健的语音评估模型方面的 effectiveness。第三部分分析了在部署MOSA-Net+的同时,将Whisper和SSL模型的表示相结合的可能性。实验结果显示,Whisper的嵌入特征与主观质量和语音识别得分之间的相关性比SSL模型的其他嵌入特征更强,为MOSA-Net+实现的更准确的预测性能做出了贡献。此外,将Whisper和SSL模型的嵌入特征相结合仅会导致微小改进。与MOSA-Net和其他基于SSL的语音评估模型相比,MOSA-Net+在估计主观质量和语音识别得分方面实现了显著的改进。我们在2023年声音MOS挑战 track 3 上测试了MOSA-Net+,并取得了排名最高的性能。
https://arxiv.org/abs/2309.12766
Semantic similarity between natural language texts is typically measured either by looking at the overlap between subsequences (e.g., BLEU) or by using embeddings (e.g., BERTScore, S-BERT). Within this paper, we argue that when we are only interested in measuring the semantic similarity, it is better to directly predict the similarity using a fine-tuned model for such a task. Using a fine-tuned model for the STS-B from the GLUE benchmark, we define the STSScore approach and show that the resulting similarity is better aligned with our expectations on a robust semantic similarity measure than other approaches.
自然语言文本之间的语义相似性通常可以通过比较序列之间的重叠(例如,BLEU)或使用嵌入(例如,BERTScore,S-BERT)来测量。在本文中,我们认为,如果只关心测量语义相似性,则最好直接使用专门为该任务微调的模型来预测相似性。使用从GLUE基准测试中的微调模型STS-B来定义STSScore方法,并表明,结果的相似性与我们对于可靠的语义相似性测量期望的对齐更好。
https://arxiv.org/abs/2309.12697
Mixup is an effective data augmentation method that generates new augmented samples by aggregating linear combinations of different original samples. However, if there are noises or aberrant features in the original samples, Mixup may propagate them to the augmented samples, leading to over-sensitivity of the model to these outliers . To solve this problem, this paper proposes a new Mixup method called AMPLIFY. This method uses the Attention mechanism of Transformer itself to reduce the influence of noises and aberrant values in the original samples on the prediction results, without increasing additional trainable parameters, and the computational cost is very low, thereby avoiding the problem of high resource consumption in common Mixup methods such as Sentence Mixup . The experimental results show that, under a smaller computational resource cost, AMPLIFY outperforms other Mixup methods in text classification tasks on 7 benchmark datasets, providing new ideas and new ways to further improve the performance of pre-trained models based on the Attention mechanism, such as BERT, ALBERT, RoBERTa, and GPT. Our code can be obtained at this https URL.
混合( Mixup)是一种有效的数据增强方法,通过聚合不同原始样本的线性组合生成新的增强样本。然而,如果原始样本中存在噪声或异常值,混合可能会将这些异常值传播到增强样本中,导致模型对这些异常值过于敏感。为了解决这一问题,本文提出了一种名为AMPLIFY的新混合方法。这种方法使用Transformer自身的注意力机制来减少原始样本中噪声和异常值对预测结果的影响,而无需增加可训练参数,计算成本也非常小,从而避免了常见的混合方法如句子混合(Sentence Mixup)等的高资源消耗问题。实验结果显示,在计算资源更少的情况下,AMPLIFY在7个基准数据集上的文本分类任务中比其他混合方法表现更好,提供了基于注意力机制的预训练模型的注意力增强新想法和新方法,如BERT、ALBERT、RoBERTa和GPT等。我们的代码可以在这个httpsURL上获取。
https://arxiv.org/abs/2309.12689
Understanding trajectory diversity is a fundamental aspect of addressing practical traffic tasks. However, capturing the diversity of trajectories presents challenges, particularly with traditional machine learning and recurrent neural networks due to the requirement of large-scale parameters. The emerging Transformer technology, renowned for its parallel computation capabilities enabling the utilization of models with hundreds of millions of parameters, offers a promising solution. In this study, we apply the Transformer architecture to traffic tasks, aiming to learn the diversity of trajectories within vehicle populations. We analyze the Transformer's attention mechanism and its adaptability to the goals of traffic tasks, and subsequently, design specific pre-training tasks. To achieve this, we create a data structure tailored to the attention mechanism and introduce a set of noises that correspond to spatio-temporal demands, which are incorporated into the structured data during the pre-training process. The designed pre-training model demonstrates excellent performance in capturing the spatial distribution of the vehicle population, with no instances of vehicle overlap and an RMSE of 0.6059 when compared to the ground truth values. In the context of time series prediction, approximately 95% of the predicted trajectories' speeds closely align with the true speeds, within a deviation of 7.5144m/s. Furthermore, in the stability test, the model exhibits robustness by continuously predicting a time series ten times longer than the input sequence, delivering smooth trajectories and showcasing diverse driving behaviors. The pre-trained model also provides a good basis for downstream fine-tuning tasks. The number of parameters of our model is over 50 million.
理解路径多样性是解决实际交通任务的基础。然而,捕获路径的多样性面临挑战,特别是与传统机器学习和循环神经网络由于需要大量的参数而面临困难。新兴的Transformer技术以其并行计算能力,使使用数百万参数的模型得以使用,提供了一个有前途的解决方案。在本研究中,我们将Transformer架构应用于交通任务,旨在学习车辆种群中的路径多样性。我们分析了Transformer的注意力机制及其对交通任务目标适应能力,并随后设计了特定的预训练任务。为了实现这一点,我们创造了适合注意力机制的数据结构,并引入一组与时间和空间要求相关的噪声,在预训练过程中将这些数据引入到结构化数据中。设计的预训练模型在捕获车辆种群的空间分布方面表现出出色的表现,没有车辆重叠的情况,RMSE与实际速度相差不到7.5144m/s。在时间序列预测上下文中,大约95%的预测路径速度与实际速度 closely align,误差不到7.5144m/s。此外,在稳定性测试中,模型表现出鲁棒性, continuously预测比输入序列更长的时间序列,提供了流畅的路径并展示了多种驾驶行为的多样性。预训练模型也为后续微调任务提供了一个很好的基础。我们模型的参数数量超过500万年前。
https://arxiv.org/abs/2309.12677
Motivated by the success of transformers in various fields, such as language understanding and image analysis, this investigation explores their application in the context of the game of Go. In particular, our study focuses on the analysis of the Transformer in Vision. Through a detailed analysis of numerous points such as prediction accuracy, win rates, memory, speed, size, or even learning rate, we have been able to highlight the substantial role that transformers can play in the game of Go. This study was carried out by comparing them to the usual Residual Networks.
基于transformer在语言理解、图像分析等领域取得成功的动机,本研究探讨其在围棋游戏中的应用。特别是,我们的研究重点分析视觉transformer。通过对多个关键点,如预测准确率、获胜率、记忆、速度、大小甚至学习率的详细分析,我们突出了transformer在围棋游戏中所扮演的重要角色。本研究通过将其与常见的残留网络进行比较来实现。
https://arxiv.org/abs/2309.12675
AI-synthesized text and images have gained significant attention, particularly due to the widespread dissemination of multi-modal manipulations on the internet, which has resulted in numerous negative impacts on society. Existing methods for multi-modal manipulation detection and grounding primarily focus on fusing vision-language features to make predictions, while overlooking the importance of modality-specific features, leading to sub-optimal results. In this paper, we construct a simple and novel transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. To achieve this, we introduce visual/language pre-trained encoders and dual-branch cross-attention (DCA) to extract and fuse modality-unique features. Furthermore, we design decoupled fine-grained classifiers (DFC) to enhance modality-specific feature mining and mitigate modality competition. Moreover, we propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality using learnable queries, thereby improving the discovery of forged details. Extensive experiments on the $\rm DGM^4$ dataset demonstrate the superior performance of our proposed model compared to state-of-the-art approaches.
AI合成的文本和图像已经获得了广泛关注,特别是由于互联网上多模态操纵的广泛传播,对社会造成了众多负面影响。现有的多模态操纵检测和grounding方法主要关注融合视觉语言特征来做出预测,而忽视了模态特定特征的重要性,导致结果不佳。在本文中,我们构建了一个简单而新颖的Transformer框架,用于多模态操纵检测和grounding任务。我们的框架同时探索模态特定特征,并保留多模态对齐的能力。为了实现这一点,我们引入了视觉/语言预训练编码器和双重分支交叉注意力(DCA)来提取和融合模态特定特征。此外,我们还设计分离的细粒度分类器(DFC)来增强模态特定特征挖掘和减轻模态竞争。此外,我们提出了一种隐含操纵查询(IMQ),使用可学习查询自适应地聚合每个模态的全球上下文线索,从而提高伪造细节的发现。在DGM^4数据集上进行广泛的实验表明,我们提出的模型相比现有方法具有更好的性能。
https://arxiv.org/abs/2309.12657
This paper details our speaker diarization system designed for multi-domain, multi-microphone casual conversations. The proposed diarization pipeline uses weighted prediction error (WPE)-based dereverberation as a front end, then applies end-to-end neural diarization with vector clustering (EEND-VC) to each channel separately. It integrates the diarization result obtained from each channel using diarization output voting error reduction plus overlap (DOVER-LAP). To harness the knowledge from the target domain and results integrated across all channels, we apply self-supervised adaptation for each session by retraining the EEND-VC with pseudo-labels derived from DOVER-LAP. The proposed system was incorporated into NTT's submission for the distant automatic speech recognition task in the CHiME-7 challenge. Our system achieved 65 % and 62 % relative improvements on development and eval sets compared to the organizer-provided VC-based baseline diarization system, securing third place in diarization performance.
这篇文章介绍了我们设计的适用于多领域、多麦克风 casual conversations 的 speaker diarization 系统。该 diarization 系统采用基于加权预测误差(WPE)的声学去混响作为前端,然后将全端神经网络声学归一化与向量聚类(EEND-VC)分别应用于每个通道。它通过减少 diarization 输出的投票错误以及融合(DOVER-LAP)来实现每个通道的声学归一化结果,并将它们整合在一起。为了从目标领域中提取知识和将整合在所有通道中的结果进行训练,我们在每个会话中使用自监督适应技术,通过从 DOVER-LAP 中推导出伪标签来重新训练 EEND-VC。该提议系统被 NTT 纳入了 CHiME-7 挑战中远程自动语音识别任务提交的候选列表中。我们的系统相对于组织者提供的基于 VC 基线的声学基线 diarization 系统在开发和应用集上实现了 65 % 和 62 % 的相对改进,确保了声学归一化性能的第三名。
https://arxiv.org/abs/2309.12656
One of the problems in quantitative finance that has received the most attention is the portfolio optimization problem. Regarding its solving, this problem has been approached using different techniques, with those related to quantum computing being especially prolific in recent years. In this study, we present a system called Quantum Computing-based System for Portfolio Optimization with Future Asset Values and Automatic Universe Reduction (Q4FuturePOP), which deals with the Portfolio Optimization Problem considering the following innovations: i) the developed tool is modeled for working with future prediction of assets, instead of historical values; and ii) Q4FuturePOP includes an automatic universe reduction module, which is conceived to intelligently reduce the complexity of the problem. We also introduce a brief discussion about the preliminary performance of the different modules that compose the prototypical version of Q4FuturePOP.
在量化金融中,最受关注的问题之一是投资组合优化问题。关于如何解决这一问题,已经采用了多种技术,与量子计算相关的技术尤为活跃。在本研究中,我们介绍了一个系统,称为基于量子计算的投资组合优化系统,包括未来资产价值自动宇宙减少(Q4FuturePOP)。该系统处理了投资组合优化问题,考虑了以下创新:第一,开发工具是建模用于处理未来资产预测,而不是历史价值;第二,Q4FuturePOP包括一个自动宇宙减少模块,旨在 intelligently 减少问题的复杂性。我们还介绍了关于组成Q4FuturePOP的典型版本不同模块的初步性能的简要讨论。
https://arxiv.org/abs/2309.12627
In the U.S. inpatient payment system, the Diagnosis-Related Group (DRG) plays a key role but its current assignment process is time-consuming. We introduce DRG-LLaMA, a large language model (LLM) fine-tuned on clinical notes for improved DRG prediction. Using Meta's LLaMA as the base model, we optimized it with Low-Rank Adaptation (LoRA) on 236,192 MIMIC-IV discharge summaries. With an input token length of 512, DRG-LLaMA-7B achieved a macro-averaged F1 score of 0.327, a top-1 prediction accuracy of 52.0% and a macro-averaged Area Under the Curve (AUC) of 0.986. Impressively, DRG-LLaMA-7B surpassed previously reported leading models on this task, demonstrating a relative improvement in macro-averaged F1 score of 40.3% compared to ClinicalBERT and 35.7% compared to CAML. When DRG-LLaMA is applied to predict base DRGs and complication or comorbidity (CC) / major complication or comorbidity (MCC), the top-1 prediction accuracy reached 67.8% for base DRGs and 67.5% for CC/MCC status. DRG-LLaMA performance exhibits improvements in correlation with larger model parameters and longer input context lengths. Furthermore, usage of LoRA enables training even on smaller GPUs with 48 GB of VRAM, highlighting the viability of adapting LLMs for DRGs prediction.
在美国的住院支付系统中,诊断相关组(DRG)扮演着关键角色,但其当前的任务分配过程却相当耗时。我们引入了DRG-LLaMA,这是一款大型语言模型(LLM),通过在临床记录中优化,以提高DRG预测能力。我们将Meta的LLaMA作为基础模型,通过低秩适应(LoRA)优化它在236,192MIMIC-IV出院摘要中的预测性能。输入 token 长度为512,DRG-LLaMA-7B获得了 macro-averaged F1 得分0.327,top-1预测准确率为52.0%,以及 macro-averaged AUC为0.986。相比之下,DRG-LLaMA-7B在任务中超过了先前报告的主要模型,表现出相对于 ClinicalBERT 和 CAML的 macro-averaged F1 得分相对改善。当将DRG-LLaMA应用于预测基础诊断相关组和并发症或复杂性(CC)或主要并发症或复杂性(MCC)时,基础诊断相关组的预测准确率达到了67.8%,CC/MCC状态的预测准确率则达到了67.5%。DRG-LLaMA 的表现与更大的模型参数和更长的输入上下文长度呈相关改善。此外,使用LoRA可以使在小GPU上使用48GB VRAM的训练中进行训练,强调了 adaptLLMs 对 DRG 预测的可行性。
https://arxiv.org/abs/2309.12625
Surface electromyography (sEMG) and high-density sEMG (HD-sEMG) biosignals have been extensively investigated for myoelectric control of prosthetic devices, neurorobotics, and more recently human-computer interfaces because of their capability for hand gesture recognition/prediction in a wearable and non-invasive manner. High intraday (same-day) performance has been reported. However, the interday performance (separating training and testing days) is substantially degraded due to the poor generalizability of conventional approaches over time, hindering the application of such techniques in real-life practices. There are limited recent studies on the feasibility of multi-day hand gesture recognition. The existing studies face a major challenge: the need for long sEMG epochs makes the corresponding neural interfaces impractical due to the induced delay in myoelectric control. This paper proposes a compact ViT-based network for multi-day dynamic hand gesture prediction. We tackle the main challenge as the proposed model only relies on very short HD-sEMG signal windows (i.e., 50 ms, accounting for only one-sixth of the convention for real-time myoelectric implementation), boosting agility and responsiveness. Our proposed model can predict 11 dynamic gestures for 20 subjects with an average accuracy of over 71% on the testing day, 3-25 days after training. Moreover, when calibrated on just a small portion of data from the testing day, the proposed model can achieve over 92% accuracy by retraining less than 10% of the parameters for computational efficiency.
表面电感测量(sEMG)和高密度sEMG(HD-sEMG)生物信号已经被广泛研究用于肢体残疾控制、神经机器人学以及最近的人机接口,因为它们能够在佩戴且非侵入性的情况下进行手动作识别/预测。每日(当天)表现 据报道很高。然而,每日表现(区分训练和测试日)因传统方法的泛化性能较差而大幅度退化,阻碍将这些技术应用于实际实践中。目前,关于一天多次手动作识别的可行性研究有限。现有的研究面临一个主要挑战:需要长sEMG epochs导致相应的神经接口不可能实现,因为肌电控制引起的延迟。本 paper 提出了一种紧凑的ViT-based网络,用于一天多次的动态手动作预测。我们克服了主要挑战,因为 proposed 模型只需要非常短的HD-sEMG信号窗口(即50 ms,只占实时肌电实现的传统标准的六分之一),提高敏捷性和响应性。我们 proposed 模型可以预测20名 subjects 11种动态手势,在测试日,平均准确率超过71%,训练3-25天后。此外,当仅从测试日的数据中校准一小部分数据时,该模型可以实现超过92%的准确率,通过减少计算效率不到10%的参数重新训练。
https://arxiv.org/abs/2309.12602