Value function factorization methods are commonly used in cooperative multi-agent reinforcement learning, with QMIX receiving significant attention. Many QMIX-based methods introduce monotonicity constraints between the joint action value and individual action values to achieve decentralized execution. However, such constraints limit the representation capacity of value factorization, restricting the joint action values it can represent and hindering the learning of the optimal policy. To address this challenge, we propose the Potentially Optimal joint actions Weighted QMIX (POWQMIX) algorithm, which recognizes the potentially optimal joint actions and assigns higher weights to the corresponding losses of these joint actions during training. We theoretically prove that with such a weighted training approach the optimal policy is guaranteed to be recovered. Experiments in matrix games, predator-prey, and StarCraft II Multi-Agent Challenge environments demonstrate that our algorithm outperforms the state-of-the-art value-based multi-agent reinforcement learning methods.
价值函数分解方法在合作多智能体强化学习中应用广泛,特别是QMIX受到了广泛关注。许多基于QMIX的方法在联合动作价值和个体动作值之间引入了单调性约束,以实现分布式执行。然而,这些约束限制了价值函数分解的表示能力,限制了它可以代表的联合动作值的数量,并阻碍了最优策略的学习。为了应对这个挑战,我们提出了可能的最优联合动作加权QMIX(POWQMIX)算法,它认识到了可能最优的联合动作,并在训练过程中为这些联合动作分配更高的权重。我们通过理论证明,这样的加权训练方法可以保证在给定条件下,最优策略可以被恢复。在矩阵游戏、捕食者-被捕食者环境和星际争霸II多智能体挑战环境中进行的实验证明,我们的算法超越了当前基于价值的 multi-agent 强化学习方法。
https://arxiv.org/abs/2405.08036
This study presents a novel methodology utilizing a pre-trained speech recognition model for processing respiratory sound data. By incorporating medical record information, we introduce an innovative multi-modal deep-learning architecture, named Rene, which addresses the challenges of poor interpretability and underperformance in real-time clinical diagnostic response observed in previous respiratory disease-focused models. The proposed Rene architecture demonstrated significant improvements of 10.24%, 16.15%, 15.29%, and 18.90% respectively, compared to the baseline across four tasks related to respiratory event detection and audio record classification on the SPRSound database. In patient disease prediction tests on the ICBHI database, the architecture exhibited improvements of 23% in the mean of average score and harmonic score compared to the baseline. Furthermore, we developed a real-time respiratory sound discrimination system based on the Rene architecture, featuring a dual-thread design and compressed model parameters for simultaneous microphone recording and real-time dynamic decoding. Employing state-of-the-art Edge AI technology, this system enables rapid and accurate responses for respiratory sound auscultation, facilitating deployment on wearable clinical detection devices to capture incremental data, which can be synergistically evolved with large-scale models deployed on cloud servers for downstream tasks.
这项研究提出了一种利用预训练语音识别模型处理呼吸音数据的新型方法。通过结合病历信息,我们引入了一种名为Rene的创新多模态深度学习架构,该架构解决了之前呼吸疾病关注模型中观察到的实时临床诊断反应差和性能不足的问题。所提出的Rene架构在SPRSound数据库上的四个与呼吸事件检测和音频记录分类相关的任务中的基线相比,分别显示出10.24%、16.15%、15.29%和18.90%的显著改善。在ICBHI数据库上的患者疾病预测测试中,该架构显示出基线上的平均分数和和谐分数的改善率为23%。此外,我们还基于Rene架构开发了一个实时呼吸音识别系统,具有双线程设计和压缩模型参数的实时动态解码功能。通过采用最先进的边缘人工智能技术,该系统能够快速且准确地响应呼吸音听诊,促进在可穿戴式临床检测设备上部署,以捕捉连续数据,并与在云服务器上部署的大规模模型协同进化,实现下游任务的协同进化。
https://arxiv.org/abs/2405.07442
In visual place recognition, accurately identifying and matching images of locations under varying environmental conditions and viewpoints remains a significant challenge. In this paper, we introduce a new technique, called Bag-of-Queries (BoQ), which learns a set of global queries designed to capture universal place-specific attributes. Unlike existing methods that employ self-attention and generate the queries directly from the input features, BoQ employs distinct learnable global queries, which probe the input features via cross-attention, ensuring consistent information aggregation. In addition, our technique provides an interpretable attention mechanism and integrates with both CNN and Vision Transformer backbones. The performance of BoQ is demonstrated through extensive experiments on 14 large-scale benchmarks. It consistently outperforms current state-of-the-art techniques including NetVLAD, MixVPR and EigenPlaces. Moreover, as a global retrieval technique (one-stage), BoQ surpasses two-stage retrieval methods, such as Patch-NetVLAD, TransVPR and R2Former, all while being orders of magnitude faster and more efficient. The code and model weights are publicly available at this https URL.
在视觉位置识别中,准确地识别和匹配不同环境条件和观点下的地点图像仍然是一个重要的挑战。在本文中,我们引入了一种新的技术,称为Bag-of-Queries(BoQ),它学习了一组全局查询,旨在捕捉普遍地点特有的属性。与现有的方法不同,BoQ采用了一种独特的可学习全局查询的方法,通过跨注意来探究输入特征,确保信息的一致聚合。此外,我们的技术提供了一个可解释的注意机制,并集成了CNN和Vision Transformer骨干网络。通过在ImageNet等14个大型基准上进行广泛的实验,BoQ的性能证明了其超越了当前最先进的技巧,包括NetVLAD、MixVPR和EigenPlaces。此外,作为全局检索技术(一阶段),BoQ超越了两阶段检索方法,如Patch-NetVLAD、TransVPR和R2Former,所有这些都是以 orders of magnitude 更快的速度和更高效的方式。代码和模型权重都可以在本文的https URL上公开获得。
https://arxiv.org/abs/2405.07364
The application of Automatic Speech Recognition (ASR) technology in soccer offers numerous opportunities for sports analytics. Specifically, extracting audio commentaries with ASR provides valuable insights into the events of the game, and opens the door to several downstream applications such as automatic highlight generation. This paper presents SoccerNet-Echoes, an augmentation of the SoccerNet dataset with automatically generated transcriptions of audio commentaries from soccer game broadcasts, enhancing video content with rich layers of textual information derived from the game audio using ASR. These textual commentaries, generated using the Whisper model and translated with Google Translate, extend the usefulness of the SoccerNet dataset in diverse applications such as enhanced action spotting, automatic caption generation, and game summarization. By incorporating textual data alongside visual and auditory content, SoccerNet-Echoes aims to serve as a comprehensive resource for the development of algorithms specialized in capturing the dynamics of soccer games. We detail the methods involved in the curation of this dataset and the integration of ASR. We also highlight the implications of a multimodal approach in sports analytics, and how the enriched dataset can support diverse applications, thus broadening the scope of research and development in the field of sports analytics.
自动语音识别(ASR)技术在足球中的应用带来了许多体育分析的机会。特别是,通过提取来自足球比赛直播的音频评论,使用ASR提取音频评论提供了对比赛事件的宝贵见解,并打开了几个下游应用的大门,如自动高光提取。本文介绍了SoccerNet-Echoes,一种通过从足球比赛直播中自动生成音频评论来扩充SoccerNet数据集的增强视频内容,利用ASR生成的文本信息丰富地增强了视频内容的应用。这些文本评论使用Whisper模型生成,并使用谷歌翻译翻译。通过结合文本数据和视觉和听觉内容,SoccerNet-Echoes旨在成为开发专门捕捉足球比赛动态的算法的全面资源。我们详细介绍了这个数据集的编辑方法和ASR的集成。我们还强调了体育分析中多模态方法的意义,以及丰富数据集如何支持各种应用,从而扩大体育分析领域的研究和开发范围。
https://arxiv.org/abs/2405.07354
This study introduces a novel Supervised Info-enhanced Contrastive Learning framework for EEG based Emotion Recognition (SICLEER). SI-CLEER employs multi-granularity contrastive learning to create robust EEG contextual representations, potentiallyn improving emotion recognition effectiveness. Unlike existing methods solely guided by classification loss, we propose a joint learning model combining self-supervised contrastive learning loss and supervised classification loss. This model optimizes both loss functions, capturing subtle EEG signal differences specific to emotion detection. Extensive experiments demonstrate SI-CLEER's robustness and superior accuracy on the SEED dataset compared to state-of-the-art methods. Furthermore, we analyze electrode performance, highlighting the significance of central frontal and temporal brain region EEGs in emotion detection. This study offers an universally applicable approach with potential benefits for diverse EEG classification tasks.
这项研究提出了一种新颖的基于监督的增强对比学习框架,用于基于脑电波(EEG)的情感识别(SICLEER)。SI-CLEER采用多粒度对比学习来创建稳健的EEG上下文表示,从而可能提高情感识别的有效性。与现有的方法仅基于分类损失不同,我们提出了一个结合自监督对比学习损失和监督分类损失的联合学习模型。该模型优化了两个损失函数,捕捉到特定于情感检测的微小EEG信号差异。大量实验证明,SI-CLEER在SEED数据集上的鲁棒性和卓越准确性比最先进的methods要强。此外,我们分析了电极性能,强调了中央前额和颞叶皮层EEG在情感检测中的重要性。这项研究提供了一种通用的方法,对各种EEG分类任务具有潜在的益处。
https://arxiv.org/abs/2405.07260
Despite the rapid advancement in the field of image recognition, the processing of high-resolution imagery remains a computational challenge. However, this processing is pivotal for extracting detailed object insights in areas ranging from autonomous vehicle navigation to medical imaging analyses. Our study introduces a framework aimed at mitigating these challenges by leveraging memory efficient patch based processing for high resolution images. It incorporates a global context representation alongside local patch information, enabling a comprehensive understanding of the image content. In contrast to traditional training methods which are limited by memory constraints, our method enables training of ultra high resolution images. We demonstrate the effectiveness of our method through superior performance on 7 different benchmarks across classification, object detection, and segmentation. Notably, the proposed method achieves strong performance even on resource-constrained devices like Jetson Nano. Our code is available at this https URL.
尽管在图像识别领域,进展迅速,但处理高分辨率图像仍然是一个计算上的挑战。然而,这种处理对于从自动驾驶到医学影像分析等各个领域提取详细的目标洞察至关重要。我们的研究旨在通过利用基于记忆高效的图像补丁处理来减轻这些挑战。它结合了全局上下文表示和局部补丁信息,实现了对图像内容的全面理解。与传统训练方法受到内存限制不同,我们的方法可以训练超高清图像。我们通过在分类、目标检测和分割等7个不同的基准测试中取得卓越的性能,证明了我们方法的优越性。值得注意的是,与资源受限的设备(如Jetson Nano)相比,所提出的方法在性能上同样具有优势。我们的代码可以从以下链接获取。
https://arxiv.org/abs/2405.07166
Recognizing human actions from point cloud sequence has attracted tremendous attention from both academia and industry due to its wide applications. However, most previous studies on point cloud action recognition typically require complex networks to extract intra-frame spatial features and inter-frame temporal features, resulting in an excessive number of redundant computations. This leads to high latency, rendering them impractical for real-world applications. To address this problem, we propose a Plane-Fit Redundancy Encoding point cloud sequence network named PRENet. The primary concept of our approach involves the utilization of plane fitting to mitigate spatial redundancy within the sequence, concurrently encoding the temporal redundancy of the entire sequence to minimize redundant computations. Specifically, our network comprises two principal modules: a Plane-Fit Embedding module and a Spatio-Temporal Consistency Encoding module. The Plane-Fit Embedding module capitalizes on the observation that successive point cloud frames exhibit unique geometric features in physical space, allowing for the reuse of spatially encoded data for temporal stream encoding. The Spatio-Temporal Consistency Encoding module amalgamates the temporal structure of the temporally redundant part with its corresponding spatial arrangement, thereby enhancing recognition accuracy. We have done numerous experiments to verify the effectiveness of our network. The experimental results demonstrate that our method achieves almost identical recognition accuracy while being nearly four times faster than other state-of-the-art methods.
从点云序列中识别人类动作引起了学术界和产业界的高度关注,因为它具有广泛的应用。然而,大多数先前的点云动作识别研究通常需要复杂的网络来提取帧内空间特征和帧间时间特征,导致冗余计算数量过多。这导致延迟过高,使得它们对于现实应用不再实用。为了解决这个问题,我们提出了一个名为PRENet的平滑fit冗余编码点云序列网络。我们方法的主要思想是利用平滑fit来减轻序列内的空间冗余,同时编码整个序列的时间冗余以最小化冗余计算。具体来说,我们的网络由两个主要模块组成:平滑fit嵌入模块和时域-空间一致性编码模块。平滑fit嵌入模块利用观察到连续点云帧在物理空间中具有独特的几何特征的事实,实现地理位置编码数据的重复利用。时域-空间一致性编码模块将时间冗余部分与相应的空间布局相结合,从而提高识别准确性。我们进行了大量实验来验证我们网络的有效性。实验结果表明,与最先进的方法相比,我们的方法具有几乎相同的识别准确度,同时速度快了约四倍。
https://arxiv.org/abs/2405.06929
This study introduces a novel data augmentation technique, ADLDA, aimed at mitigating the negative impact of data distribution shifts caused by the data augmentation process in computer vision task. ADLDA partitions augmented data into distinct subdomains and incorporates domain labels, combined with domain adaptation techniques, to optimize data representation in the model's feature space. Experimental results demonstrate that ADLDA significantly enhances model performance across multiple datasets, particularly in neural network architectures with complex feature extraction layers. Furthermore, ADLDA improves the model's ability to locate and recognize key features, showcasing potential in object recognition and image segmentation tasks. This paper's contribution provides an effective data augmentation regularization method for the field of computer vision aiding in the enhancement of robustness and accuracy in deep learning models.
本研究介绍了一种新的数据增强技术——ADLDA,旨在减轻数据增强过程中数据分布变化对计算机视觉任务造成的负面影响。ADLDA将增强的数据划分为不同的子域,并引入领域标签,结合领域自适应技术,优化模型在特征空间中的数据表示。实验结果表明,ADLDA在多个数据集上显著增强了模型的性能,特别是在具有复杂特征提取层的神经网络架构中。此外,ADLDA提高了模型在关键特征的定位和识别能力,展示了其在物体识别和图像分割任务中的潜力。本论文对计算机视觉领域提供了一种有效的数据增强规范方法,以增强深度学习模型的稳健性和准确性。
https://arxiv.org/abs/2405.06893
Deep neural networks (DNNs) have been used to create models for many complex analysis problems like image recognition and medical diagnosis. DNNs are a popular tool within machine learning due to their ability to model complex patterns and distributions. However, the performance of these networks is highly dependent on the quality of the data used to train the models. Two characteristics of these sets, noisy labels and training set biases, are known to frequently cause poor generalization performance as a result of overfitting to the training set. This paper aims to solve this problem using the approach proposed by Ren et al. (2018) using meta-training and online weight approximation. We will first implement a toy-problem to crudely verify the claims made by the authors of Ren et al. (2018) and then venture into using the approach to solve a real world problem of Skin-cancer detection using an imbalanced image dataset.
深度神经网络(DNNs)已经用于许多复杂的分析问题,如图像识别和医学诊断。 DNNs 在机器学习领域因能够建模复杂模式和分布而成为一种流行的工具。然而,这些网络的性能高度依赖于用于训练模型的数据的质量。这两组数据的两个特点,即噪声标签和训练集偏差,已经被认为是导致过拟合训练集而导致泛化性能差的主要原因。本文旨在使用Ren等人(2018)提出的元训练和在线权重近似的方案来解决这个问题。首先,我们将实现一个简单的玩具问题,以粗略验证Ren等人(2018)所提出的观点,然后将使用该方法解决一个真实世界的问题——使用不平衡图像数据集检测皮肤癌。
https://arxiv.org/abs/2405.06859
The task of medical image recognition is notably complicated by the presence of varied and multiple pathological indications, presenting a unique challenge in multi-label classification with unseen labels. This complexity underlines the need for computer-aided diagnosis methods employing multi-label zero-shot learning. Recent advancements in pre-trained vision-language models (VLMs) have showcased notable zero-shot classification abilities on medical images. However, these methods have limitations on leveraging extensive pre-trained knowledge from broader image datasets, and often depend on manual prompt construction by expert radiologists. By automating the process of prompt tuning, prompt learning techniques have emerged as an efficient way to adapt VLMs to downstream tasks. Yet, existing CoOp-based strategies fall short in performing class-specific prompts on unseen categories, limiting generalizability in fine-grained scenarios. To overcome these constraints, we introduce a novel prompt generation approach inspirited by text generation in natural language processing (NLP). Our method, named Pseudo-Prompt Generating (PsPG), capitalizes on the priori knowledge of multi-modal features. Featuring a RNN-based decoder, PsPG autoregressively generates class-tailored embedding vectors, i.e., pseudo-prompts. Comparative evaluations on various multi-label chest radiograph datasets affirm the superiority of our approach against leading medical vision-language and multi-label prompt learning methods. The source code is available at this https URL
医学图像识别任务的复杂性显著地由多种病理诊断表现的存在所加剧,这为在多标签分类中处理未见标签的挑战带来了独特的挑战。这种复杂性突出了需要使用多标签零样本学习来进行计算机辅助诊断的方法。最近,预训练视觉语言模型(VLMs)在医学图像上的显著零样本分类能力引起了人们的关注。然而,这些方法在利用更广泛的预训练知识方面存在局限,并且通常依赖于专家放射科医生的手动提示构建。通过自动调整提示过程,提示学习技术已成为将VLMs适应下游任务的有效方法。然而,现有的CoOp基策略在为未见类别生成类特定提示时存在局限,从而限制了在细粒度场景下的泛化能力。为了克服这些限制,我们引入了一种基于自然语言处理(NLP)的全新提示生成方法,我们称之为伪提示生成(PsPG)。PsPG利用多模态特征的先前知识。它采用循环神经网络(RNN)的解码器,逐个生成类定制嵌入向量,即伪提示。在各种多标签胸部X光片数据集上的比较评估证实了我们的方法相对于最先进的医学视觉语言和多标签提示学习方法具有优越性。源代码可在此链接下载:https://url.cn/
https://arxiv.org/abs/2405.06468
Continual Novel Class Discovery (CNCD) aims to continually discover novel classes without labels while maintaining the recognition capability for previously learned classes. The main challenges faced by CNCD include the feature-discrepancy problem, the inter-session confusion problem, etc. In this paper, we propose a novel Feature Enhancement and Adaptation method for the CNCD to tackle the above challenges, which consists of a guide-to-novel framework, a centroid-to-samples similarity constraint (CSS), and a boundary-aware prototype constraint (BAP). More specifically, the guide-to-novel framework is established to continually discover novel classes under the guidance of prior distribution. Afterward, the CSS is designed to constrain the relationship between centroid-to-samples similarities of different classes, thereby enhancing the distinctiveness of features among novel classes. Finally, the BAP is proposed to keep novel class features aware of the positions of other class prototypes during incremental sessions, and better adapt novel class features to the shared feature space. Experimental results on three benchmark datasets demonstrate the superiority of our method, especially in more challenging protocols with more incremental sessions.
Continual Novel Class Discovery(CNCD)旨在不断发现无标签的新类,同时保持之前学习到的类的识别能力。CNCD面临的主要挑战包括特征差异问题、会话混淆问题等。在本文中,我们提出了一种用于CNCD解决上述问题的新颖特征增强和适应方法,该方法包括指导-新颖框架、聚类中心到样本相似性约束(CSS)和边界感知原型约束(BAP)。具体来说,指导新颖框架是在先验分布的指导下不断发现新颖类。然后,CSS旨在约束不同类聚类中心到样本之间的相似性关系,从而增强新类之间特征的差异性。最后,BAP提出了一种方法,使新颖类特征在递增会话过程中保持对其他类原型位置的感知,并更好地将新颖类特征适应到共享特征空间。在三个基准数据集上的实验结果表明,我们的方法在更具挑战性的协议中具有优越性,尤其是在具有更多递增会话的更难的协议中。
https://arxiv.org/abs/2405.06389
The Universal Similarity Metric (USM) has been demonstrated to give practically useful measures of "similarity" between sequence data. Here we have used the USM as an alternative distance metric in a K-Nearest Neighbours (K-NN) learner to allow effective pattern recognition of variable length sequence data. We compare this USM approach with the commonly used string-to-word vector approach. Our experiments have used two data sets of divergent domains: (1) spam email filtering and (2) protein subcellular localization. Our results with this data reveal that the USM-based K-NN learner (1) gives predictions with higher classification accuracy than those output by techniques that use the string-to-word vector approach, and (2) can be used to generate reliable probability forecasts.
The Universal Similarity Metric (USM)已经被证明可以在序列数据之间提供实际有用的“相似性”度量。在这里,我们使用USM作为K-最近邻(K-NN)学习器的替代距离度量,以实现对长序列数据的有效模式识别。我们将USM方法与通常使用的字符到单词向量方法进行比较。我们的实验使用了两个分叉域的数据集:(1)垃圾邮件过滤和(2)蛋白质亚细胞定位。我们使用这两个数据集的结果表明,基于USM的K-NN学习器(1)输出的预测比使用字符到单词向量方法的预测具有更高的分类准确性,而(2)可以用于生成可靠的概率预测。
https://arxiv.org/abs/2405.06301
Automatic speech recognition (ASR) systems, increasingly prevalent in education, healthcare, employment, and mobile technology, face significant challenges in inclusivity, particularly for the 80 million-strong global community of people who stutter. These systems often fail to accurately interpret speech patterns deviating from typical fluency, leading to critical usability issues and misinterpretations. This study evaluates six leading ASRs, analyzing their performance on both a real-world dataset of speech samples from individuals who stutter and a synthetic dataset derived from the widely-used LibriSpeech benchmark. The synthetic dataset, uniquely designed to incorporate various stuttering events, enables an in-depth analysis of each ASR's handling of disfluent speech. Our comprehensive assessment includes metrics such as word error rate (WER), character error rate (CER), and semantic accuracy of the transcripts. The results reveal a consistent and statistically significant accuracy bias across all ASRs against disfluent speech, manifesting in significant syntactical and semantic inaccuracies in transcriptions. These findings highlight a critical gap in current ASR technologies, underscoring the need for effective bias mitigation strategies. Addressing this bias is imperative not only to improve the technology's usability for people who stutter but also to ensure their equitable and inclusive participation in the rapidly evolving digital landscape.
自动语音识别(ASR)系统在教育、医疗、就业和移动技术等领域越来越普遍,但它们在包容性方面面临重大挑战,特别是对于全球8000万说普通话的人来说。这些系统往往无法准确解释从说普通话的人那里收集到的发音模式,导致关键可用性问题和误解。这项研究评估了六个领先ASR,分析它们在真实世界数据集和基于广泛使用的LibriSpeech基准的合成数据上的表现。合成数据独特地设计,以包含各种口音事件,能够深入分析每个ASR对口音语言的处理。我们的全面评估包括单词错误率(WER)、字符错误率和文本的语义准确度等指标。结果显示,所有ASR对口音语言的准确性存在一致且统计显著的偏差,在转录中表现出明显的语法和语义不准确。这些发现突出了当前ASR技术的一个关键缺陷,强调需要有效的偏差缓解策略。解决这个偏差不仅是提高口语音步障碍人士技术可用性的必要条件,而且还要确保他们在快速发展的数字环境中平等和包容地参与。
https://arxiv.org/abs/2405.06150
Substance use disorders (SUDs) are a growing concern globally, necessitating enhanced understanding of the problem and its trends through data-driven research. Social media are unique and important sources of information about SUDs, particularly since the data in such sources are often generated by people with lived experiences. In this paper, we introduce Reddit-Impacts, a challenging Named Entity Recognition (NER) dataset curated from subreddits dedicated to discussions on prescription and illicit opioids, as well as medications for opioid use disorder. The dataset specifically concentrates on the lesser-studied, yet critically important, aspects of substance use--its clinical and social impacts. We collected data from chosen subreddits using the publicly available Application Programming Interface for Reddit. We manually annotated text spans representing clinical and social impacts reported by people who also reported personal nonmedical use of substances including but not limited to opioids, stimulants and benzodiazepines. Our objective is to create a resource that can enable the development of systems that can automatically detect clinical and social impacts of substance use from text-based social media data. The successful development of such systems may enable us to better understand how nonmedical use of substances affects individual health and societal dynamics, aiding the development of effective public health strategies. In addition to creating the annotated data set, we applied several machine learning models to establish baseline performances. Specifically, we experimented with transformer models like BERT, and RoBERTa, one few-shot learning model DANN by leveraging the full training dataset, and GPT-3.5 by using one-shot learning, for automatic NER of clinical and social impacts. The dataset has been made available through the 2024 SMM4H shared tasks.
成瘾性物质使用障碍(SUDs)是全球一个不断增长的问题,这需要通过数据驱动的研究来加强对其问题和趋势的理解。社交媒体是独特的并且重要的关于SUD的信息来源,特别是因为这些来源的数据通常是由有生活经历的人生成的。在本文中,我们介绍了Reddit-Impacts,一个由致力于讨论处方的镇痛药和成瘾性物质使用药物的Reddit子社区创建的具有挑战性的命名实体识别(NER)数据集,以及用于成瘾性物质使用障碍的药物。该数据集特别关注物质使用-其临床和社会影响方面。我们通过公开可用的Reddit应用程序接口收集来自所选子社区的数据。我们手动注释了包括报告个人非医疗使用物质(如镇痛药、兴奋剂和苯二氮䓬类)的临床和社会影响文本范围。我们的目标是创建一个资源,可以促进开发可以自动检测文本社交媒体数据中的临床和社会影响的系统。开发此类系统的成功可能有助于我们更好地了解非医疗物质使用对个人健康和社会动态的影响,从而促进公共卫生策略的有效发展。除了创建注释数据集外,我们还应用了几种机器学习模型来建立基线性能。具体来说,我们尝试了BERT、RoBERTa和DANN等Transformer模型,并利用完整的训练数据集,以及通过一次学习利用GPT-3.5进行自动实体识别(NER)的模型。数据集已通过2024年的SMM4H共享任务发布。
https://arxiv.org/abs/2405.06145
Recent developments in large speech foundation models like Whisper have led to their widespread use in many automatic speech recognition (ASR) applications. These systems incorporate `special tokens' in their vocabulary, such as $\texttt{<endoftext>}$, to guide their language generation process. However, we demonstrate that these tokens can be exploited by adversarial attacks to manipulate the model's behavior. We propose a simple yet effective method to learn a universal acoustic realization of Whisper's $\texttt{<endoftext>}$ token, which, when prepended to any speech signal, encourages the model to ignore the speech and only transcribe the special token, effectively `muting' the model. Our experiments demonstrate that the same, universal 0.64-second adversarial audio segment can successfully mute a target Whisper ASR model for over 97\% of speech samples. Moreover, we find that this universal adversarial audio segment often transfers to new datasets and tasks. Overall this work demonstrates the vulnerability of Whisper models to `muting' adversarial attacks, where such attacks can pose both risks and potential benefits in real-world settings: for example the attack can be used to bypass speech moderation systems, or conversely the attack can also be used to protect private speech data.
近年来,如Whisper这样的大型语言模型的发展已经导致它们在许多自动语音识别(ASR)应用程序中得到了广泛应用。这些系统在其词汇中包含一些特殊标记,如<endoftext>,以指导其语言生成过程。然而,我们发现这些标记可以被攻击者用于操纵模型的行为。为了学习Whisper的<endoftext>标记的通用音素实现,我们提出了一种简单而有效的方法,该方法允许模型在将特殊标记附加到任何语音信号之前,忽略 speech,而仅转录特殊标记,从而有效地“沉默”模型。我们的实验证明,同样的通用 0.64 秒 adversarial 音频片段可以成功沉默目标 Whisper ASR 模型超过 97% 的语音样本。此外,我们发现,这个通用 adversarial 音频片段通常会转移到新的数据集和任务中。总的来说,这项工作证明了Whisper模型的易受“静音”攻击的风险,这些攻击在现实环境中有可能既带来风险又带来潜在好处。例如,攻击可以用于绕过 speech 审核系统,反之,攻击也可以用于保护个人语音数据。
https://arxiv.org/abs/2405.06134
Large Language Models (LLMs) have played an important role in many fields due to their powerful capabilities.However, their massive number of parameters leads to high deployment requirements and incurs significant inference costs, which impedes their practical applications. Training smaller models is an effective way to address this problem. Therefore, we introduce OpenBA-V2, a 3.4B model derived from multi-stage compression and continual pre-training from the original 15B OpenBA model. OpenBA-V2 utilizes more data, more flexible training objectives, and techniques such as layer pruning, neural pruning, and vocabulary pruning to achieve a compression rate of 77.3\% with minimal performance loss. OpenBA-V2 demonstrates competitive performance compared to other open-source models of similar size, achieving results close to or on par with the 15B OpenBA model in downstream tasks such as common sense reasoning and Named Entity Recognition (NER). OpenBA-V2 illustrates that LLMs can be compressed into smaller ones with minimal performance loss by employing advanced training objectives and data strategies, which may help deploy LLMs in resource-limited scenarios.
大语言模型(LLMs)因其强大的功能在许多领域发挥了重要作用。然而,它们庞大的参数数量导致高部署需求,且会导致显著的推理成本,这阻碍了它们的应用。训练更小的模型是解决这个问题的有效方法。因此,我们引入了OpenBA-V2,一个来源于多阶段压缩和连续预训练的15B OpenBA模型的3.4B模型。OpenBA-V2利用了更多的数据,更灵活的训练目标,以及诸如层剪枝、神经剪枝和词汇剪枝等技术,实现了压缩率为77.3%的极佳性能。与具有类似大小的其他开源模型相比,OpenBA-V2展示了竞争力的性能,在下游任务(如常识推理和命名实体识别(NER))中的表现与15B OpenBA模型相当或接近。OpenBA-V2说明,通过采用先进的训练目标和数据策略,LLM可以将其参数压缩到更小的模型中,且性能损失最小。这有助于将LLM在资源受限的场景中进行部署。
https://arxiv.org/abs/2405.05957
Masked face recognition (MFR) has emerged as a critical domain in biometric identification, especially by the global COVID-19 pandemic, which introduced widespread face masks. This survey paper presents a comprehensive analysis of the challenges and advancements in recognising and detecting individuals with masked faces, which has seen innovative shifts due to the necessity of adapting to new societal norms. Advanced through deep learning techniques, MFR, along with Face Mask Recognition (FMR) and Face Unmasking (FU), represent significant areas of focus. These methods address unique challenges posed by obscured facial features, from fully to partially covered faces. Our comprehensive review delves into the various deep learning-based methodologies developed for MFR, FMR, and FU, highlighting their distinctive challenges and the solutions proposed to overcome them. Additionally, we explore benchmark datasets and evaluation metrics specifically tailored for assessing performance in MFR research. The survey also discusses the substantial obstacles still facing researchers in this field and proposes future directions for the ongoing development of more robust and effective masked face recognition systems. This paper serves as an invaluable resource for researchers and practitioners, offering insights into the evolving landscape of face recognition technologies in the face of global health crises and beyond.
遮罩脸识别(MFR)已成为生物识别领域的一个关键领域,尤其是在全球COVID-19大流行期间,人们开始广泛佩戴面部口罩。这份调查论文对识别和检测遮罩脸个体的挑战和进展进行全面分析,由于适应新的社会规范的必要性,出现了创新的变化。通过深度学习技术取得了显著进步的MFR、Face Mask Recognition(FMR)和Face Unmasking(FU)代表了重要的研究重点。这些方法解决了遮罩脸特征独特的问题,从完全到部分遮盖的脸部。我们全面的回顾深入研究了为MFR、FMR和FU开发的各种基于深度学习的技术,突出了它们的独特挑战以及为解决这些挑战提出的解决方案。此外,我们还探讨了专门针对评估MFR研究绩效的基准数据集和评估指标。调查还讨论了该领域研究人员仍然面临的重大障碍,并提出了未来在开发更健壮和有效的遮罩脸识别系统方面的建议。本文对研究人员和从业者来说是一个宝贵的资源,揭示了全球卫生危机及以后面部识别技术演变的可能性。
https://arxiv.org/abs/2405.05900
This article presents the world's first rapid drone flocking control using natural language through generative AI. The described approach enables the intuitive orchestration of a flock of any size to achieve the desired geometry. The key feature of the method is the development of a new interface based on Large Language Models to communicate with the user and to generate the target geometry descriptions. Users can interactively modify or provide comments during the construction of the flock geometry model. By combining flocking technology and defining the target surface using a signed distance function, smooth and adaptive movement of the drone swarm between target states is achieved. Our user study on FlockGPT confirmed a high level of intuitive control over drone flocking by users. Subjects who had never previously controlled a swarm of drones were able to construct complex figures in just a few iterations and were able to accurately distinguish the formed swarm drone figures. The results revealed a high recognition rate for six different geometric patterns generated through the LLM-based interface and performed by a simulated drone flock (mean of 80% with a maximum of 93\% for cube and tetrahedron patterns). Users commented on low temporal demand (19.2 score in NASA-TLX), high performance (26 score in NASA-TLX), attractiveness (1.94 UEQ score), and hedonic quality (1.81 UEQ score) of the developed system. The FlockGPT demo code repository can be found at: coming soon
本文介绍了使用自然语言通过生成式人工智能实现世界范围内第一个快速无人机集群控制的方法。描述的方法允许用户直观地编排任意大小的集群以达到所需的形状。该方法的关键特点是基于大型语言模型开发的新接口,用于与用户交互并生成目标形状描述。用户在集群几何模型构建过程中可以交互式修改或提供评论。通过结合无人机集群技术和使用带签名距离函数定义目标表面,实现了无人机集群在目标状态之间的平滑和自适应运动。我们对FlockGPT的用户研究证实了用户对无人机集群的直观控制程度很高。之前没有控制过无人机集群的受试者只用几步就能构建出复杂的形状,并且能够准确地区分形成的无人机集群形状。结果表明,基于LLM的接口生成的六种不同几何图案的识别率为80%到93%。用户对系统的时间需求低(NASA-TLX中的19.2分),性能高(26分),吸引力高(1.94 UEQ分数),审美观好(1.81 UEQ分数)发表了评论。FlockGPT的演示代码存储库可以在:即将发布。
https://arxiv.org/abs/2405.05872
In text recognition, self-supervised pre-training emerges as a good solution to reduce dependence on expansive annotated real data. Previous studies primarily focus on local visual representation by leveraging mask image modeling or sequence contrastive learning. However, they omit modeling the linguistic information in text images, which is crucial for recognizing text. To simultaneously capture local character features and linguistic information in visual space, we propose Symmetric Superimposition Modeling (SSM). The objective of SSM is to reconstruct the direction-specific pixel and feature signals from the symmetrically superimposed input. Specifically, we add the original image with its inverted views to create the symmetrically superimposed inputs. At the pixel level, we reconstruct the original and inverted images to capture character shapes and texture-level linguistic context. At the feature level, we reconstruct the feature of the same original image and inverted image with different augmentations to model the semantic-level linguistic context and the local character discrimination. In our design, we disrupt the character shape and linguistic rules. Consequently, the dual-level reconstruction facilitates understanding character shapes and linguistic information from the perspective of visual texture and feature semantics. Experiments on various text recognition benchmarks demonstrate the effectiveness and generality of SSM, with 4.1% average performance gains and 86.6% new state-of-the-art average word accuracy on Union14M benchmarks.
在文本识别中,自监督预训练成为减少对广泛注释的现实数据依赖的一个好方法。以前的研究主要通过利用掩膜图像建模或序列对比学习来关注局部视觉表示。然而,它们忽略了在文本图像中建模语言信息,这对于识别文本至关重要。为了同时捕捉视觉空间中的局部字符特征和语言信息,我们提出了对称超像素建模(SSM)。 SSM 的目标是对对称超像素输入进行重建。具体来说,我们将原始图像及其倒置版本添加到对称超像素输入中。在像素级别,我们重建原始和倒置图像以捕捉字符形状和文本级别语境。在特征级别,我们使用不同的增强来重建相同原始图像和倒置图像的特征,以建模语义级别的语境和局部字符识别。 在我们的设计中,我们打破了字符形状和语言规则。因此,双级别重构有助于从视觉纹理和特征语义的角度理解字符形状和语言信息。在各种文本识别基准实验中,SSM 的有效性和普遍性得到了证明,平均性能提高了4.1%,在联合14M基准测试中的平均词准确率达到了86.6%。
https://arxiv.org/abs/2405.05841
Machine learning-based techniques open up many opportunities and improvements to derive deeper and more practical insights from data that can help businesses make informed decisions. However, the majority of these techniques focus on the conventional closed-set scenario, in which the label spaces for the training and test sets are identical. Open set recognition (OSR) aims to bring classification tasks in a situation that is more like reality, which focuses on classifying the known classes as well as handling unknown classes effectively. In such an open-set problem the gathered samples in the training set cannot encompass all the classes and the system needs to identify unknown samples at test time. On the other hand, building an accurate and comprehensive model in a real dynamic environment presents a number of obstacles, because it is prohibitively expensive to train for every possible example of unknown items, and the model may fail when tested in testbeds. This study provides an algorithm exploring a new representation of feature space to improve classification in OSR tasks. The efficacy and efficiency of business processes and decision-making can be improved by integrating OSR, which offers more precise and insightful predictions of outcomes. We demonstrate the performance of the proposed method on three established datasets. The results indicate that the proposed model outperforms the baseline methods in accuracy and F1-score.
基于机器学习的技术为数据提供了许多深度和更实用的洞见,从而帮助企业做出明智的决策。然而,大多数这些技术都关注传统关闭式场景,其中训练和测试集的标签空间相同。开箱即用识别(OSR)旨在实现一个更类似于现实的情况,重点在于将已知类别进行分类,同时有效地处理未知类别。在这样一个开箱即用问题中,训练集中的样本无法涵盖所有类别,系统需要在测试时识别未知样本。另一方面,在真实动态环境中构建准确且全面的模型会面临许多障碍,因为训练每个未知物品的例子是非常昂贵且模型可能在测试基座上失败。 本研究提供了一种探索新特征空间以提高OSR任务分类效果的算法。通过将OSR与已知类别进行整合,可以提供更精确和深入的预测结果,从而提高企业过程和决策的有效性和成效。我们展示了所提出的方法在三个现有数据集上的效果。结果表明,与基线方法相比,所提出的模型在准确性和F1分数方面均表现优异。
https://arxiv.org/abs/2405.05836