Effective image classification hinges on discerning relevant features from both foreground and background elements, with the foreground typically holding the critical information. While humans adeptly classify images with limited exposure, artificial neural networks often struggle with feature selection from rare samples. To address this challenge, we propose a novel method for selecting class-relevant patch embeddings. Our approach involves splitting support and query images into patches, encoding them using a pre-trained Vision Transformer (ViT) to obtain class embeddings and patch embeddings, respectively. Subsequently, we filter patch embeddings using class embeddings to retain only the class-relevant ones. For each image, we calculate the similarity between class embedding and each patch embedding, sort the similarity sequence in descending order, and only retain top-ranked patch embeddings. By prioritizing similarity between the class embedding and patch embeddings, we select top-ranked patch embeddings to be fused with class embedding to form a comprehensive image representation, enhancing pattern recognition across instances. Our strategy effectively mitigates the impact of class-irrelevant patch embeddings, yielding improved performance in pre-trained models. Extensive experiments on popular few-shot classification benchmarks demonstrate the simplicity, efficacy, and computational efficiency of our approach, outperforming state-of-the-art baselines under both 5-shot and 1-shot scenarios.
有效的图像分类依赖于从前景和背景元素中辨别相关特征,通常前景持有关键信息。虽然人类在有限曝光下也能够分类图像,但人工神经网络通常在从罕见样本中选择特征时遇到困难。为了应对这个挑战,我们提出了一种选择类相关补丁嵌入的新方法。我们的方法将支持性和查询图像分割成补丁,并使用预训练的Vision Transformer(ViT)对其进行编码,分别获得类嵌入和补丁嵌入。接下来,我们使用类嵌入过滤补丁嵌入,保留只有类相关的补丁。对于每个图像,我们计算类嵌入与每个补丁嵌入之间的相似度,将相似度序列按下降顺序排序,并仅保留排名靠前的补丁嵌入。通过优先考虑类嵌入与补丁嵌入之间的相似性,我们选择排名靠前的补丁嵌入与类嵌入融合,形成全面图像表示,增强模式识别。通过有效地减轻类无关补丁嵌入的影响,我们的策略在预训练模型上产生了改进。在流行的小样本分类基准上进行广泛的实验,证明了我们的方法的简单性、有效性和计算效率,在5-shot和1-shot场景下均优于最先进的基线。
https://arxiv.org/abs/2405.03722
This paper presents GeoContrastNet, a language-agnostic framework to structured document understanding (DU) by integrating a contrastive learning objective with graph attention networks (GATs), emphasizing the significant role of geometric features. We propose a novel methodology that combines geometric edge features with visual features within an overall two-staged GAT-based framework, demonstrating promising results in both link prediction and semantic entity recognition performance. Our findings reveal that combining both geometric and visual features could match the capabilities of large DU models that rely heavily on Optical Character Recognition (OCR) features in terms of performance accuracy and efficiency. This approach underscores the critical importance of relational layout information between the named text entities in a semi-structured layout of a page. Specifically, our results highlight the model's proficiency in identifying key-value relationships within the FUNSD dataset for forms and also discovering the spatial relationships in table-structured layouts for RVLCDIP business invoices. Our code and pretrained models will be accessible on our official GitHub.
本文提出了一种名为GeoContrastNet的多语言文档理解(DU)框架,通过将对比学习目标与图注意力网络(GATs)相结合,强调几何特征在文档理解中的重要性。我们提出了一种结合几何边缘特征和视觉特征的新的方法,构建了一个基于两个阶段GAT的框架,并在链接预测和语义实体识别方面的表现表明了潜在的积极效果。我们的研究结果表明,结合几何和视觉特征可以与高度依赖OCR特征的大型DU模型的性能准确性和效率相匹敌。这种方法突出了在半结构化页面布局中命名文本实体之间关系布局信息的关键性。具体来说,我们的结果强调了模型在FUNSD数据集中识别表单关键值关系的能力,以及发现RVLCDIP商业发票表格结构布局中的空间关系的能力。我们的代码和预训练模型将公开发布在官方GitHub上。
https://arxiv.org/abs/2405.03104
We present SketchGPT, a flexible framework that employs a sequence-to-sequence autoregressive model for sketch generation, and completion, and an interpretation case study for sketch recognition. By mapping complex sketches into simplified sequences of abstract primitives, our approach significantly streamlines the input for autoregressive modeling. SketchGPT leverages the next token prediction objective strategy to understand sketch patterns, facilitating the creation and completion of drawings and also categorizing them accurately. This proposed sketch representation strategy aids in overcoming existing challenges of autoregressive modeling for continuous stroke data, enabling smoother model training and competitive performance. Our findings exhibit SketchGPT's capability to generate a diverse variety of drawings by adding both qualitative and quantitative comparisons with existing state-of-the-art, along with a comprehensive human evaluation study. The code and pretrained models will be released on our official GitHub.
我们提出了SketchGPT,一个灵活的框架,它采用序列到序列自回归模型用于草图生成和完成,以及用于草图识别的交互案例研究。通过将复杂草图映射为抽象基本结构的简化序列,我们的方法大大简化了自回归建模的输入。SketchGPT利用下一个词预测目标策略来理解草图模式,促进草图的创作和完成,并准确分类它们。这种提出的草图表示策略有助于克服连续 stroke 数据中自回归建模的现有挑战,使得模型训练更加平滑,同时实现具有竞争力的性能。我们的研究结果表明,SketchGPT通过添加定性和定量与现有最先进的水平的比较,具有生成各种不同草图的能力,以及全面的用户评估研究。代码和预训练模型将在我们的官方 GitHub 上发布。
https://arxiv.org/abs/2405.03099
This project investigates the human multi-modal behavior identification algorithm utilizing deep neural networks. According to the characteristics of different modal information, different deep neural networks are used to adapt to different modal video information. Through the integration of various deep neural networks, the algorithm successfully identifies behaviors across multiple modalities. In this project, multiple cameras developed by Microsoft Kinect were used to collect corresponding bone point data based on acquiring conventional images. In this way, the motion features in the image can be extracted. Ultimately, the behavioral characteristics discerned through both approaches are synthesized to facilitate the precise identification and categorization of behaviors. The performance of the suggested algorithm was evaluated using the MSR3D data set. The findings from these experiments indicate that the accuracy in recognizing behaviors remains consistently high, suggesting that the algorithm is reliable in various scenarios. Additionally, the tests demonstrate that the algorithm substantially enhances the accuracy of detecting pedestrian behaviors in video footage.
本项目研究了利用深度神经网络对人类多模态行为识别的算法。根据不同模态信息的特征,不同的深度神经网络被用于适应不同模态视频信息。通过整合各种深度神经网络,该算法成功跨越多个模态识别出行为。在这个项目中,利用微软Kinect开发的多个摄像头收集了相应的骨点数据,以便获取传统图像。这种方式,可以提取图像中的运动特征。最终,通过这两种方法的合成,得出的行为特征用于准确识别和分类行为。对所提算法的性能使用了MSR3D数据集进行评估。这些实验的结果表明,通过这两种方法识别行为的准确度始终保持不变,表明该算法在不同场景下非常可靠。此外,测试还表明,该算法在视频录像中的行人行为检测方面显著增强了准确度。
https://arxiv.org/abs/2405.03091
Colorectal cancer contributes significantly to cancer-related mortality. Timely identification and elimination of polyps through colonoscopy screening is crucial in order to decrease mortality rates. Accurately detecting polyps in colonoscopy images is difficult because of the differences in characteristics such as size, shape, texture, and similarity to surrounding tissues. Current deep-learning methods often face difficulties in capturing long-range connections necessary for segmentation. This research presents BetterNet, a convolutional neural network (CNN) architecture that combines residual learning and attention methods to enhance the accuracy of polyp segmentation. The primary characteristics encompass (1) a residual decoder architecture that facilitates efficient gradient propagation and integration of multiscale features. (2) channel and spatial attention blocks within the decoder block to concentrate the learning process on the relevant areas of polyp regions. (3) Achieving state-of-the-art performance on polyp segmentation benchmarks while still ensuring computational efficiency. (4) Thorough ablation tests have been conducted to confirm the influence of architectural components. (5) The model code has been made available as open-source for further contribution. Extensive evaluations conducted on datasets such as Kvasir-SEG, CVC ClinicDB, Endoscene, EndoTect, and Kvasir-Sessile demonstrate that BetterNets outperforms current SOTA models in terms of segmentation accuracy by significant margins. The lightweight design enables real-time inference for various applications. BetterNet shows promise in integrating computer-assisted diagnosis techniques to enhance the detection of polyps and the early recognition of cancer. Link to the code: this https URL
直肠癌对癌症相关死亡率的贡献非常大。通过结肠镜筛查及时发现和消除结肠内的结节是降低死亡率的關鍵。然而,准确地在结肠镜图像中检测结节存在很大困难,因为结肠内结节的特征(如大小、形状、质地和与周围组织的相似性)存在差異。目前的大深度学习方法往往在捕捉分割过程中需要的长距离连接方面遇到困难。这项研究提出了BetterNet,一种结合残差学习和关注方法的卷积神经网络(CNN)架构,以提高结肠癌分割的准确性。主要特点包括:(1)一个残差解码器架构,可促进高效的梯度传播和多尺度特征整合。(2)解码器block内的通道和空间关注块,以将学习过程集中在结肠癌区域的 relevant 区域上。(3)在保证准确性的同时提高结肠癌分割基准测试的性能。(4)已经对建筑组件进行了全面消融测试,以确认其影响。(5)模型代码已公开为开源贡献,以进一步发挥其作用。在Kvasir-SEG、CVC诊所数据库、Endoscene、EndoTect和Kvasir-Sessile等数据集上进行的大量评估证明,BetterNets在分割准确性方面显著优于当前的最优模型。轻量级的设计使得各种应用实现实时推理。BetterNet在将计算机辅助诊断技术集成到结肠癌检测和早期癌症识别方面具有前景。链接到代码:https:// this URL
https://arxiv.org/abs/2405.04288
We introduce an advanced, swift pattern recognition strategy for various multiple robotics during curve negotiation. This method, leveraging a sophisticated k-means clustering-enhanced Support Vector Machine algorithm, distinctly categorizes robotics into flying or mobile robots. Initially, the paradigm considers robot locations and features as quintessential parameters indicative of divergent robot patterns. Subsequently, employing the k-means clustering technique facilitates the efficient segregation and consolidation of robotic data, significantly optimizing the support vector delineation process and expediting the recognition phase. Following this preparatory phase, the SVM methodology is adeptly applied to construct a discriminative hyperplane, enabling precise classification and prognostication of the robot category. To substantiate the efficacy and superiority of the k-means framework over traditional SVM approaches, a rigorous cross-validation experiment was orchestrated, evidencing the former's enhanced performance in robot group classification.
我们提出了一个在曲线协商中用于各种机器人的高级、快速的模式识别策略。该方法利用了一个先进的k-means聚类增强支持向量机算法,将机器人明显地分为飞行或移动机器人。最初,范式将机器人的位置和特征视为表明分散机器人模式的基本参数。随后,利用k-means聚类技术有助于有效地将机器数据进行分离和合并,显著优化支持向量轮廓绘制过程,并加速识别阶段。在预备阶段之后,将SVM方法应用于构建区分性的超平面,使得精确分类和预测机器类别。为了证实k-means框架相对于传统SVM方法的优越性和有效性,进行了一项严谨的交叉验证实验,结果表明前者在机器人群体分类方面表现更好。
https://arxiv.org/abs/2405.03026
Training data memorization in language models impacts model capability (generalization) and safety (privacy risk). This paper focuses on analyzing prompts' impact on detecting the memorization of 6 masked language model-based named entity recognition models. Specifically, we employ a diverse set of 400 automatically generated prompts, and a pairwise dataset where each pair consists of one person's name from the training set and another name out of the set. A prompt completed with a person's name serves as input for getting the model's confidence in predicting this name. Finally, the prompt performance of detecting model memorization is quantified by the percentage of name pairs for which the model has higher confidence for the name from the training set. We show that the performance of different prompts varies by as much as 16 percentage points on the same model, and prompt engineering further increases the gap. Moreover, our experiments demonstrate that prompt performance is model-dependent but does generalize across different name sets. A comprehensive analysis indicates how prompt performance is influenced by prompt properties, contained tokens, and the model's self-attention weights on the prompt.
在语言模型的训练数据记忆中,会对模型的能力和安全性产生影响。本文重点分析提示对检测基于6个遮罩语言模型的人名识别模型的记忆能力的影响。具体来说,我们使用了由400个自动生成的提示组成的一组提示,以及一个由训练集中的每个人名和一个不在集合中的另一个名字组成的一对数据。一个人名的提示作为输入,用于获取模型对预测此名称的自信度。最后,通过比较提示对模型记忆能力的性能,用来自训练集的名称对模型的自信度的百分比衡量了提示的效果。我们发现,不同提示的性能差异可以达到16个百分比,而提示工程进一步增加了差距。此外,我们的实验还表明,提示性能取决于提示的属性、所包含的标记和提示模型的自注意权重。 Training data memorization in language models impacts model capability (generalization) and safety (privacy risk). This paper focuses on analyzing prompts' impact on detecting the memorization of 6 masked language model-based named entity recognition models. Specifically, we employ a diverse set of 400 automatically generated prompts, and a pairwise dataset where each pair consists of one person's name from the training set and another name out of the set. A prompt completed with a person's name serves as input for getting the model's confidence in predicting this name. Finally, the prompt performance of detecting model memorization is quantified by the percentage of name pairs for which the model has higher confidence for the name from the training set. We show that the performance of different prompts varies by as much as 16 percentage points on the same model, and prompt engineering further increases the gap. Moreover, our experiments demonstrate that prompt performance is model-dependent but does generalize across different name sets. A comprehensive analysis indicates how prompt performance is influenced by prompt properties, contained tokens, and the model's self-attention weights on the prompt.
https://arxiv.org/abs/2405.03004
As interest in large language models (LLMs) grows, the importance of accuracy in automatic speech recognition (ASR) has become more pronounced. This is particularly true for lectures that include specialized terminology, where the success rate of traditional ASR models tends to be low, posing a challenging problem. A method to improve ASR performance for specialized terminology using the word frequency difference approach has been proposed. Through experiments and data analysis, we investigate whether this proposal effectively addresses the issue. Additionally, we introduce the power law as the theoretical foundation for the relative frequency
随着大型语言模型(LLMs)的兴趣不断增长,自动语音识别(ASR)中准确性的重要性变得更加突出。尤其是在包括专业术语的讲座中,传统ASR模型的成功率往往较低,这构成了具有挑战性的问题。提出了一种利用词频差异方法提高专业术语ASR性能的方法。通过实验和数据分析,我们研究了这一建议是否有效解决了这个问题。此外,我们还介绍了幂律作为相对频率的理论基础。
https://arxiv.org/abs/2405.02995
Due to the ever-increasing availability of video surveillance cameras and the growing need for crime prevention, the violence detection task is attracting greater attention from the research community. With respect to other action recognition tasks, violence detection in surveillance videos shows additional issues, such as the presence of a significant variety of real fight scenes. Unfortunately, available datasets seem to be very small compared with other action recognition datasets. Moreover, in surveillance applications, people in the scenes always differ for each video and the background of the footage differs for each camera. Also, violent actions in real-life surveillance videos must be detected quickly to prevent unwanted consequences, thus models would definitely benefit from a reduction in memory usage and computational costs. Such problems make classical action recognition methods difficult to be adopted. To tackle all these issues, we introduce JOSENet, a novel self-supervised framework that provides outstanding performance for violence detection in surveillance videos. The proposed model receives two spatiotemporal video streams, i.e., RGB frames and optical flows, and involves a new regularized self-supervised learning approach for videos. JOSENet provides improved performance compared to self-supervised state-of-the-art methods, while requiring one-fourth of the number of frames per video segment and a reduced frame rate. The source code and the instructions to reproduce our experiments are available at this https URL.
由于视频监控摄像头的不断增加和犯罪预防的需求不断增加,暴力检测任务正在从研究社区获得越来越多的关注。与其他动作识别任务相比,监控视频中的暴力检测任务还表现为其他问题,例如存在显著的实战场景。然而,现有的数据集似乎与其他动作识别数据集相比非常小。此外,在视频应用中,每个视频场景的人都有所不同,视频拍摄的角度也有所不同。为了快速检测现实生活中的暴力行为,防止不良后果,因此模型可以从内存使用和计算成本的降低中获得好处。这些问题使得经典动作识别方法难以采用。为解决这些问题,我们引入了JOSENet,一种新颖的自监督框架,可以在监控视频中的暴力检测方面提供卓越的表现。与自监督先进的视频分类方法相比,所提出的模型具有更好的性能,同时每个视频片段需要四分之一帧数和降低帧率。JOSENet的源代码和重现实验的说明可以在该链接处找到。
https://arxiv.org/abs/2405.02961
Source-free domain adaptation (SFDA) aims to adapt a source model trained on a fully-labeled source domain to a related but unlabeled target domain. While the source model is a key avenue for acquiring target pseudolabels, the generated pseudolabels may exhibit source bias. In the conventional SFDA pipeline, a large data (e.g. ImageNet) pre-trained feature extractor is used to initialize the source model at the start of source training, and subsequently discarded. Despite having diverse features important for generalization, the pre-trained feature extractor can overfit to the source data distribution during source training and forget relevant target domain knowledge. Rather than discarding this valuable knowledge, we introduce an integrated framework to incorporate pre-trained networks into the target adaptation process. The proposed framework is flexible and allows us to plug modern pre-trained networks into the adaptation process to leverage their stronger representation learning capabilities. For adaptation, we propose the Co-learn algorithm to improve target pseudolabel quality collaboratively through the source model and a pre-trained feature extractor. Building on the recent success of the vision-language model CLIP in zero-shot image recognition, we present an extension Co-learn++ to further incorporate CLIP's zero-shot classification decisions. We evaluate on 3 benchmark datasets and include more challenging scenarios such as open-set, partial-set and open-partial SFDA. Experimental results demonstrate that our proposed strategy improves adaptation performance and can be successfully integrated with existing SFDA methods.
源无标签域适应(SFDA)旨在将一个在全面标注的源域上训练的源模型适应到相关但未标注的目标域。尽管源模型是获取目标伪标签的关键途径,但生成的伪标签可能表现出源偏见。在传统的SFDA管道中,用于初始化源模型的预训练特征提取器在源训练开始时使用,随后被抛弃。尽管预训练特征提取器具有多样化的特征,但对于通用,预训练的提取器在源训练过程中可能会过拟合到源数据分布,并忘记相关目标领域知识。为了利用预训练网络的更强的表示学习能力,我们将预训练网络集成到目标适应过程中。所提出的框架灵活,允许我们将现代预训练网络插入到适应过程中,以充分利用其更强的表示学习能力。为了适应,我们提出了合作学习(Co-learn)算法,通过源模型和预训练特征提取器共同改善目标伪标签的质量。我们在最近零散图像识别成功的基础上,引入了Co-learn++扩展,进一步将CLIP的零散分类决策集成到适应过程中。我们在三个基准数据集上进行评估,并包括更具有挑战性的场景,如开集、部分集和开部分集SFDA。实验结果表明,我们提出的策略提高了适应性能,可以与现有的SFDA方法成功集成。
https://arxiv.org/abs/2405.02954
Artificial neural networks trained on large, expert-labelled datasets are considered state-of-the-art for a range of medical image recognition tasks. However, categorically labelled datasets are time-consuming to generate and constrain classification to a pre-defined, fixed set of classes. For neuroradiological applications in particular, this represents a barrier to clinical adoption. To address these challenges, we present a self-supervised text-vision framework that learns to detect clinically relevant abnormalities in brain MRI scans by directly leveraging the rich information contained in accompanying free-text neuroradiology reports. Our training approach consisted of two-steps. First, a dedicated neuroradiological language model - NeuroBERT - was trained to generate fixed-dimensional vector representations of neuroradiology reports (N = 50,523) via domain-specific self-supervised learning tasks. Next, convolutional neural networks (one per MRI sequence) learnt to map individual brain scans to their corresponding text vector representations by optimising a mean square error loss. Once trained, our text-vision framework can be used to detect abnormalities in unreported brain MRI examinations by scoring scans against suitable query sentences (e.g., 'there is an acute stroke', 'there is hydrocephalus' etc.), enabling a range of classification-based applications including automated triage. Potentially, our framework could also serve as a clinical decision support tool, not only by suggesting findings to radiologists and detecting errors in provisional reports, but also by retrieving and displaying examples of pathologies from historical examinations that could be relevant to the current case based on textual descriptors.
通过在大型、专家标注的数据集上训练的人工神经网络被认为是各种医学图像识别任务的当前最先进的。然而,分类标注的数据集需要花费较长的时间来生成,并限制将分类限制为预定义、固定的类。特别是,在神经放射学应用中,这代表了临床采用的障碍。为了应对这些挑战,我们提出了一个自监督的文本视觉框架,通过直接利用伴随的免费文本神经放射学报告中的丰富信息来检测临床相关的异常脑MRI扫描。我们的训练方法包括两个步骤。首先,一个专门的语言模型——NeuroBERT 通过领域特定的自监督学习任务训练,生成固定维度的神经放射学报告的固定维向量表示(N = 50,523)。接下来,卷积神经网络(每个MRI序列一个)通过优化均方误差损失来学习将单个脑扫描映射到相应的文本向量表示。经过训练后,我们的文本视觉框架可用于通过评分扫描与适当的查询句子(例如,“有急性中风”,“有高血压”等)相匹配来检测未报告的脑MRI examination中的异常,实现各种分类基础应用(包括自动分类分诊)。可能的是,我们的框架还可以作为临床决策支持工具,不仅通过向放射科医生建议发现,还通过根据文本描述检索和显示历史检查中的疾病实例来发挥作用。
https://arxiv.org/abs/2405.02782
In this study, we address one of the challenges of developing NER models for scholarly domains, namely the scarcity of suitable labeled data. We experiment with an approach using predictions from a fine-tuned LLM model to aid non-domain experts in annotating scientific entities within astronomy literature, with the goal of uncovering whether such a collaborative process can approximate domain expertise. Our results reveal moderate agreement between a domain expert and the LLM-assisted non-experts, as well as fair agreement between the domain expert and the LLM model's predictions. In an additional experiment, we compare the performance of finetuned and default LLMs on this task. We have also introduced a specialized scientific entity annotation scheme for astronomy, validated by a domain expert. Our approach adopts a scholarly research contribution-centric perspective, focusing exclusively on scientific entities relevant to the research theme. The resultant dataset, containing 5,000 annotated astronomy article titles, is made publicly available.
在这项研究中,我们研究了在学术领域开发自然语言实体识别(NER)模型的一个挑战:合适标注数据的稀缺性。我们尝试使用来自微调的LLM模型的预测来帮助非领域专家在天文文学中注释科学实体,以揭示是否可以这样一个合作过程可以 approximate领域专业知识。我们的结果表明,领域专家和LLM辅助的非专家在科学实体注释方面存在适度一致性,以及领域专家和LLM模型的预测之间的公平一致性。在另一个实验中,我们比较了微调和默认LLM模型在这项任务上的表现。我们还引入了一个由领域专家验证的专门的天文学科学实体注释方案。我们的方法采用研究主题为中心的视角,专注于与研究主题相关的科学实体。由此产生的数据集,包含5,000个注释的天文学文章标题,已经公开发布。
https://arxiv.org/abs/2405.02602
This paper introduces Mixat: a dataset of Emirati speech code-mixed with English. Mixat was developed to address the shortcomings of current speech recognition resources when applied to Emirati speech, and in particular, to bilignual Emirati speakers who often mix and switch between their local dialect and English. The data set consists of 15 hours of speech derived from two public podcasts featuring native Emirati speakers, one of which is in the form of conversations between the host and a guest. Therefore, the collection contains examples of Emirati-English code-switching in both formal and natural conversational contexts. In this paper, we describe the process of data collection and annotation, and describe some of the features and statistics of the resulting data set. In addition, we evaluate the performance of pre-trained Arabic and multi-lingual ASR systems on our dataset, demonstrating the shortcomings of existing models on this low-resource dialectal Arabic, and the additional challenge of recognizing code-switching in ASR. The dataset will be made publicly available for research use.
本文介绍了Mixat数据集:这是用英语对阿联酋语音进行混合的数据集。Mixat数据集是为了解决应用到阿联酋语音的现有语音识别资源的不足而开发的,尤其是针对双语阿联酋 speakers,他们经常混合和切换本地方言和英语。数据集包括来自两个公共播客的非母语阿联酋人士的15小时语音,其中一个是以主持人与嘉宾之间的对话形式呈现的。因此,数据集中包含了阿联酋-英语代码转换在正式和非正式会话背景中的例子。在本文中,我们描述了数据收集和注释的过程,并描述了数据集中的某些特征和统计数字。此外,我们还评估了预训练的阿拉伯语和多语言 ASR系统在我们的数据集上的性能,证明了对于这种低资源的中东阿拉伯语,现有模型的不足以及识别代码转换在 ASR 中的挑战。该数据集将公开发布,供研究使用。
https://arxiv.org/abs/2405.02578
Human object recognition exhibits remarkable resilience in cluttered and dynamic visual environments. In contrast, despite their unparalleled performance across numerous visual tasks, Deep Neural Networks (DNNs) remain far less robust than humans, showing, for example, a surprising susceptibility to adversarial attacks involving image perturbations that are (almost) imperceptible to humans. Human object recognition likely owes its robustness, in part, to the increasingly resilient representations that emerge along the hierarchy of the ventral visual cortex. Here we show that DNNs, when guided by neural representations from a hierarchical sequence of regions in the human ventral visual stream, display increasing robustness to adversarial attacks. These neural-guided models also exhibit a gradual shift towards more human-like decision-making patterns and develop hierarchically smoother decision surfaces. Importantly, the resulting representational spaces differ in important ways from those produced by conventional smoothing methods, suggesting that such neural-guidance may provide previously unexplored robustness solutions. Our findings support the gradual emergence of human robustness along the ventral visual hierarchy and suggest that the key to DNN robustness may lie in increasing emulation of the human brain.
人类物体识别在杂乱和动态的视觉环境中表现出非凡的弹性。相比之下,尽管它们在多个视觉任务中的表现无与伦比,深度神经网络(DNNs)仍然比人类差得多,显示出对涉及图像扰动的对抗攻击的意外易感性,例如对人类无法察觉的几乎不可见的图像扰动。人类物体识别的稳健性在一定程度上要归功于随着分级视觉皮层中越来越坚韧的表示的出现。在这里,我们证明了,当指导神经表示沿着人分级视觉流中的区域时,DNNs对对抗攻击的稳健性逐渐增加。这些由神经表示引导的模型还表现出逐渐向更人类化的决策模式转变,并发展出层次化的平滑决策表面。重要的是,这些表示空间与传统平滑方法的产生有很大的不同,表明这种神经指导可能提供以前未探索的稳健性解决方案。我们的研究支持了沿分级视觉流中人类稳健性逐渐出现的现象,并表明提高对人类大脑的模拟可能是DNN稳健性的关键。
https://arxiv.org/abs/2405.02564
Panoramic Activity Recognition (PAR) aims to identify multi-granularity behaviors performed by multiple persons in panoramic scenes, including individual activities, group activities, and global activities. Previous methods 1) heavily rely on manually annotated detection boxes in training and inference, hindering further practical deployment; or 2) directly employ normal detectors to detect multiple persons with varying size and spatial occlusion in panoramic scenes, blocking the performance gain of PAR. To this end, we consider learning a detector adapting varying-size occluded persons, which is optimized along with the recognition module in the all-in-one framework. Therefore, we propose a novel Adapt-Focused bi-Propagating Prototype learning (AdaFPP) framework to jointly recognize individual, group, and global activities in panoramic activity scenes by learning an adapt-focused detector and multi-granularity prototypes as the pretext tasks in an end-to-end way. Specifically, to accommodate the varying sizes and spatial occlusion of multiple persons in crowed panoramic scenes, we introduce a panoramic adapt-focuser, achieving the size-adapting detection of individuals by comprehensively selecting and performing fine-grained detections on object-dense sub-regions identified through original detections. In addition, to mitigate information loss due to inaccurate individual localizations, we introduce a bi-propagation prototyper that promotes closed-loop interaction and informative consistency across different granularities by facilitating bidirectional information propagation among the individual, group, and global levels. Extensive experiments demonstrate the significant performance of AdaFPP and emphasize its powerful applicability for PAR.
全景活动识别(PAR)旨在识别在全景场景中多个人的多种粒度行为,包括个人活动、群体活动和全局活动。之前的方法1)在训练和推理过程中严重依赖手动标注检测框,阻碍了进一步的实用部署;或者2)直接使用大小和空间遮挡变化多样的个人检测器来检测多个大小和空间遮挡变化多样的个人,阻碍了PAR的性能提升。因此,我们考虑学习一个自适应大小遮挡的检测器,该检测器与识别模块在一体化框架中进行优化。为此,我们提出了一个名为Adapt-Focused bi-Propagating Prototype learning (AdoFPP)的新颖对齐传播原型学习(AdaFPP)框架,通过学习适应关注的检测器和多粒度原型作为端到端任务的前馈任务,以同时识别全景活动场景中的个人、群体和全局活动。 具体来说,为了适应 crowed(拥挤)全景场景中多个人的不同大小和空间遮挡,我们引入了一个全景适应焦点检测器,通过全面选择并执行在原始检测到的物体密集子区域中的精细检测,实现对个人大小的适应检测。此外,为了减轻由于不准确的个人局部定位而产生的信息损失,我们引入了一个双向信息传播原型,通过促进个体、群体和全局层次之间的双向信息传播,实现信息的有用一致性。 大量的实验证明AdoFPP 的性能显著提高,并强调了其在PAR上的强大应用价值。
https://arxiv.org/abs/2405.02538
Knowledge graphs have emerged as a sophisticated advancement and refinement of semantic networks, and their deployment is one of the critical methodologies in contemporary artificial intelligence. The construction of knowledge graphs is a multifaceted process involving various techniques, where researchers aim to extract the knowledge from existing resources for the construction since building from scratch entails significant labor and time costs. However, due to the pervasive issue of heterogeneity, the description diversity across different knowledge graphs can lead to mismatches between concepts, thereby impacting the efficacy of knowledge extraction. This Ph.D. study focuses on automatic knowledge graph extension, i.e., properly extending the reference knowledge graph by extracting and integrating concepts from one or more candidate knowledge graphs. We propose a novel knowledge graph extension framework based on entity type recognition. The framework aims to achieve high-quality knowledge extraction by aligning the schemas and entities across different knowledge graphs, thereby enhancing the performance of the extension. This paper elucidates three major contributions: (i) we propose an entity type recognition method exploiting machine learning and property-based similarities to enhance knowledge extraction; (ii) we introduce a set of assessment metrics to validate the quality of the extended knowledge graphs; (iii) we develop a platform for knowledge graph acquisition, management, and extension to benefit knowledge engineers practically. Our evaluation comprehensively demonstrated the feasibility and effectiveness of the proposed extension framework and its functionalities through quantitative experiments and case studies.
知识图谱已成为语义网络的先进发展和精炼,其在当代人工智能中扮演着关键方法论的角色。知识图谱的构建是一个多方面的过程,涉及各种技术,研究人员旨在从现有资源中提取知识,因为从头开始构建会带来大量的人力和时间成本。然而,由于普遍存在的异质性问题,知识图谱之间的描述差异可能导致概念之间的不匹配,从而影响知识提取的效力。本博士学位论文专注于自动知识图谱扩展,即通过提取和整合一个或多个候选知识图谱来扩展参考知识图谱。我们提出了一个基于实体类型识别的知识图谱扩展框架。该框架旨在通过将不同知识图谱之间的模式和实体对齐,实现高质量的知识扩展,从而提高扩展的性能。本文阐明了三个主要贡献: (i)我们提出了一种利用机器学习和属性基于相似性的实体类型识别方法,以增强知识提取; (ii)我们引入了一组评估指标来验证扩展知识图谱的质量; (iii)我们开发了一个知识图谱获取、管理和扩展平台,以帮助知识工程师实际操作。我们的评估全面展示了所提出的扩展框架的可行性和有效性,以及其实用性。
https://arxiv.org/abs/2405.02463
As the burden of herbicide resistance grows and the environmental repercussions of excessive herbicide use become clear, new ways of managing weed populations are needed. This is particularly true for cereal crops, like wheat and barley, that are staple food crops and occupy a globally significant portion of agricultural land. Even small improvements in weed management practices across these major food crops worldwide would yield considerable benefits for both the environment and global food security. Blackgrass is a major grass weed which causes particular problems in cereal crops in north-west Europe, a major cereal production area, because it has high levels of of herbicide resistance and is well adapted to agronomic practice in this region. With the use of machine vision and multispectral imaging, we investigate the effectiveness of state-of-the-art methods to identify blackgrass in wheat and barley crops. As part of this work, we provide a large dataset with which we evaluate several key aspects of blackgrass weed recognition. Firstly, we determine the performance of different CNN and transformer-based architectures on images from unseen fields. Secondly, we demonstrate the role that different spectral bands have on the performance of weed classification. Lastly, we evaluate the role of dataset size in classification performance for each of the models trialled. We find that even with a fairly modest quantity of training data an accuracy of almost 90% can be achieved on images from unseen fields.
随着除草剂抗性的增加以及过度使用除草剂对环境影响的明确,需要开发新的方法来管理杂草。这对主要粮食作物(如小麦和大麦)来说尤为重要,这些作物占据了全球农业土地面积的很大比例。即使在这些主要粮食作物上的杂草管理实践的微小改进,也能为环境和全球粮食安全带来显著好处。黑草是一种主要的大麦草,在西北欧的小麦和大麦作物中引起了 particular 问题,因为它的除草剂抗性水平很高,并且对这一地区的农业实践特别适应。利用机器视觉和多光谱成像技术,我们研究了最先进的方法识别黑草在小麦和黑麦作物中的有效性。 作为这项工作的一部分,我们提供了大量数据集,用于评估几种黑草杂草识别的关键方面。首先,我们确定不同卷积神经网络和转换器架构在未见过的场地上图像的性能。其次,我们证明了不同光谱带在杂草分类中的作用。最后,我们评估了每个试点模型对分类性能的影响。我们发现,即使只有很小的训练数据,来自未见过的场地的图像上的准确性也可以达到近 90%。
https://arxiv.org/abs/2405.02218
Generalization is a main issue for current audio deepfake detectors, which struggle to provide reliable results on out-of-distribution data. Given the speed at which more and more accurate synthesis methods are developed, it is very important to design techniques that work well also on data they were not trained this http URL this paper we study the potential of large-scale pre-trained models for audio deepfake detection, with special focus on generalization ability. To this end, the detection problem is reformulated in a speaker verification framework and fake audios are exposed by the mismatch between the voice sample under test and the voice of the claimed identity. With this paradigm, no fake speech sample is necessary in training, cutting off any link with the generation method at the root, and ensuring full generalization ability. Features are extracted by general-purpose large pre-trained models, with no need for training or fine-tuning on specific fake detection or speaker verification datasets. At detection time only a limited set of voice fragments of the identity under test is required. Experiments on several datasets widespread in the community show that detectors based on pre-trained models achieve excellent performance and show strong generalization ability, rivaling supervised methods on in-distribution data and largely overcoming them on out-of-distribution data.
泛化是当前音频深度伪造检测器面临的主要问题,它们在非分布数据上提供不可靠的结果。鉴于越来越精确的合成方法正在开发,在数据集他们并未训练过的这个http网址上,设计一些在数据集上表现良好的技术对提高深度伪造检测器的性能具有至关重要的意义。为此,我们将检测问题重新表述为说话人验证框架,通过测试语音样本与声称身份的语音之间的不匹配来暴露假音频。在这种范式下,在训练过程中不需要任何假音频样本,切断与生成方法之间的联系,并确保充分的泛化能力。大型通用预训练模型的特征是由其提取的,无需在特定的伪造检测或说话人验证数据集上进行训练或微调。在检测时只需要检测身份的有限个语音片段。在社区中广泛使用的几个数据集的实验表明,基于预训练模型的检测器实现卓越的性能,表现出很强的泛化能力,与在分布数据上的监督方法相媲美,在离散数据上大大超过了它们。
https://arxiv.org/abs/2405.02179
The continuous evolution of pre-trained speech models has greatly advanced Speech Emotion Recognition (SER). However, there is still potential for enhancement in the performance of these methods. In this paper, we present GMP-ATL (Gender-augmented Multi-scale Pseudo-label Adaptive Transfer Learning), a novel HuBERT-based adaptive transfer learning framework for SER. Specifically, GMP-ATL initially employs the pre-trained HuBERT, implementing multi-task learning and multi-scale k-means clustering to acquire frame-level gender-augmented multi-scale pseudo-labels. Then, to fully leverage both obtained frame-level and utterance-level emotion labels, we incorporate model retraining and fine-tuning methods to further optimize GMP-ATL. Experiments on IEMOCAP show that our GMP-ATL achieves superior recognition performance, with a WAR of 80.0\% and a UAR of 82.0\%, surpassing state-of-the-art unimodal SER methods, while also yielding comparable results with multimodal SER approaches.
预训练语音模型的连续进化已经极大地推动了情感识别(SER)。然而,这些方法在表现上还有很大的提升潜力。在本文中,我们提出了GMP-ATL(性别增强多尺度伪标签自适应转移学习),一种新颖的HuBERT基情感识别(SER)自适应转移学习框架。具体来说,GMP-ATL首先采用预训练的HuBERT,实现多任务学习和多尺度k-means聚类,以获取帧级的性别增强多尺度伪标签。然后,为了充分利用获得的帧级和语料水平情感标签,我们引入模型重构和微调方法,进一步优化GMP-ATL。在IEMOCAP上的实验表明,我们的GMP-ATL取得了卓越的识别性能,具有80.0%的准确率(WAR)和82.0%的召回率(UAR),超越了当前最先进的单模态SER方法,同时与多模态SER方法相当。
https://arxiv.org/abs/2405.02151
Large Language Models have demonstrated unparalleled effectiveness in various NLP tasks, and integrating LLMs with automatic speech recognition is becoming a mainstream paradigm. Building upon this momentum, our research delves into an indepth examination of this paradigm on a large opensource Chinese dataset. Specifically, our research aims to evaluate the impact of various configurations of speech encoders, LLMs, and projector modules in the context of the speech foundation encoderLLM ASR paradigm. Furthermore, we introduce a threestage training approach, expressly developed to enhance the model's ability to align auditory and textual information. The implementation of this approach, alongside the strategic integration of ASR components, enabled us to achieve the SOTA performance on the AISHELL1, TestNet, and TestMeeting test sets. Our analysis presents an empirical foundation for future research in LLMbased ASR systems and offers insights into optimizing performance using Chinese datasets. We will publicly release all scripts used for data preparation, training, inference, and scoring, as well as pretrained models and training logs to promote reproducible research.
大规模语言模型已经在各种自然语言处理任务中展示了无与伦比的效果,并将自动语音识别与大规模语言模型集成成为主流范式。在此基础上,我们的研究深入探讨了这个范式在一个大型开源中文数据集上的影响。具体来说,我们的研究旨在评估在语音基础编码器LLM ASR范式中各种配置的影响。此外,我们还介绍了一种三阶段训练方法,专门设计来增强模型对 align auditory 和文本信息的能力。这种方法的实现与ASR组件的策略集成,使我们能够在AISHELL1、TestNet和TestMeeting测试集上实现最先进的性能。我们的分析提供了LLM为基础的ASR系统未来研究的实证基础,并为使用中文数据集优化性能提供了见解。我们将公开发布所有用于数据准备、训练、推理和评分的脚本,以及预训练模型和训练日志,以促进可重复的研究。
https://arxiv.org/abs/2405.02132