Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations facilitating various applications. Recently, CLIP has been widely adopted as the vision backbone of multimodal large language models (MLLMs) to connect image inputs for language interactions. The success of CLIP as a vision-language foundation model relies on aligning web-crawled noisy text annotations at image levels. Nevertheless, such criteria may become insufficient for downstream tasks in need of fine-grained vision representations, especially when region-level understanding is demanding for MLLMs. In this paper, we improve the localization capability of CLIP with several advances. We propose a pre-training method called Contrastive Localized Language-Image Pre-training (CLOC) by complementing CLIP with region-text contrastive loss and modules. We formulate a new concept, promptable embeddings, of which the encoder produces image embeddings easy to transform into region representations given spatial hints. To support large-scale pre-training, we design a visually-enriched and spatially-localized captioning framework to effectively generate region-text pseudo-labels at scale. By scaling up to billions of annotated images, CLOC enables high-quality regional embeddings for image region recognition and retrieval tasks, and can be a drop-in replacement of CLIP to enhance MLLMs, especially on referring and grounding tasks.
对比性语言-图像预训练(CLIP)是一种被誉为用于训练视觉编码器生成图像/文本表示以促进各种应用的训练方法。最近,CLIP已被广泛采用作为多模态大型语言模型的视觉骨干,以连接图像输入为语言交互。CLIP作为视觉-语言基础模型的成功之处在于在图像级别上与网页爬取到的嘈杂文本注释对齐。然而,对于需要细粒度视觉表示的下游任务来说,这样的标准可能变得不足。在本文中,我们通过几个进展来提高CLIP的局部化能力。我们提出了一种补充方法,称为CLOC(对比性局部化语言-图像预训练),通过补充CLIP与区域文本对比损失和模块来形成一种新的概念:提示性嵌入。为了支持大规模预训练,我们设计了一个视觉丰富且局部化的摘要框架,有效地在规模上生成区域文本伪标签。通过扩展数十亿个注释的图像,CLOC为图像区域识别和检索任务提供高质量的区域嵌入,可以成为CLIP的 drop-in 替换,尤其是在参考和 grounded 任务上。
https://arxiv.org/abs/2410.02746
Recent advances in robotics are pushing real-world autonomy, enabling robots to perform long-term and large-scale missions. A crucial component for successful missions is the incorporation of loop closures through place recognition, which effectively mitigates accumulated pose estimation drift. Despite computational advancements, optimizing performance for real-time deployment remains challenging, especially in resource-constrained mobile robots and multi-robot systems since, conventional keyframe sampling practices in place recognition often result in retaining redundant information or overlooking relevant data, as they rely on fixed sampling intervals or work directly in the 3D space instead of the feature space. To address these concerns, we introduce the concept of sample space in place recognition and demonstrate how different sampling techniques affect the query process and overall performance. We then present a novel keyframe sampling approach for LiDAR-based place recognition, which focuses on redundancy minimization and information preservation in the hyper-dimensional descriptor space. This approach is applicable to both learning-based and handcrafted descriptors, and through the experimental validation across multiple datasets and descriptor frameworks, we demonstrate the effectiveness of our proposed method, showing it can jointly minimize redundancy and preserve essential information in real-time. The proposed approach maintains robust performance across various datasets without requiring parameter tuning, contributing to more efficient and reliable place recognition for a wide range of robotic applications.
近年来,机器人技术的进步推动了现实世界的自主,使得机器人能够执行长期和大规模任务。成功执行任务的关键组件是引入通过空间识别进行闭环闭合,有效减轻了累积姿态估计漂移。尽管计算取得了进步,为实时部署优化性能仍然具有挑战性,尤其是在资源受限的移动机器人和多机器人系统上,因为传统的基于关键帧的采样实践通常会导致保留冗余信息或忽视相关信息,因为他们依赖于固定的采样间隔或直接在三维空间而不是特征空间工作。为了应对这些担忧,我们引入了空间识别中的样本空间概念,并展示了不同采样技术如何影响查询过程和整体性能。然后,我们提出了一个基于LiDAR的紧凑表示空间中进行闭环闭合的新关键帧采样方法,重点关注降维和信息保留。这种方法适用于基于学习和手工制作的描述符,并通过在多个数据集和描述框架上的实验验证,证明了我们所提出方法的有效性,表明它可以同时最小化冗余并保留关键信息。这种方法在各种数据集上保持稳健的性能,无需进行参数调整,为各种机器人应用提供更高效、可靠的姿态识别。
https://arxiv.org/abs/2410.02643
While multimodal foundation models can now natively work with data beyond text, they remain underutilized in analyzing the considerable amounts of multi-dimensional time-series data in fields like healthcare, finance, and social sciences, representing a missed opportunity for richer, data-driven insights. This paper proposes a simple but effective method that leverages the existing vision encoders of these models to "see" time-series data via plots, avoiding the need for additional, potentially costly, model training. Our empirical evaluations show that this approach outperforms providing the raw time-series data as text, with the additional benefit that visual time-series representations demonstrate up to a 90% reduction in model API costs. We validate our hypothesis through synthetic data tasks of increasing complexity, progressing from simple functional form identification on clean data, to extracting trends from noisy scatter plots. To demonstrate generalizability from synthetic tasks with clear reasoning steps to more complex, real-world scenarios, we apply our approach to consumer health tasks - specifically fall detection, activity recognition, and readiness assessment - which involve heterogeneous, noisy data and multi-step reasoning. The overall success in plot performance over text performance (up to an 120% performance increase on zero-shot synthetic tasks, and up to 150% performance increase on real-world tasks), across both GPT and Gemini model families, highlights our approach's potential for making the best use of the native capabilities of foundation models.
虽然多模态基础模型现在可以原生地处理数据,包括文本以外,但在医疗、金融和社会科学等领域中处理大量多维度时间序列数据时,它们仍然没有被充分利用,这代表了一个丰富的数据驱动见解的错失机会。本文提出了一种简单而有效的的方法,利用现有模型的视觉编码器来“通过绘制图表”看待时间序列数据,无需进行昂贵的额外模型训练。我们的实证评估结果表明,这种方法在提供原始时间序列数据作为文本的同时超过了它,并且还具有使视觉时间序列表示减少模型API成本的额外优势。我们通过模拟数据任务来验证我们的假设,从简单的功能形式识别开始,逐渐进展到从嘈杂散点图提取趋势。为了从模拟任务中展示从清晰推理步骤到更复杂、现实世界场景的泛化能力,我们将该方法应用于消费者健康任务——尤其是跌倒检测、活动识别和准备评估,这些任务涉及异质、嘈杂数据和多步骤推理。在基于文本的绘制表现与基于图的绘制表现之间(在零散投放虚拟任务中的性能增加达到120%,在真实世界任务中的性能增加达到150%)的全面成功,突出了我们在基础模型上充分利用原功能的能力。
https://arxiv.org/abs/2410.02637
For many Automatic Speech Recognition (ASR) tasks audio features as spectrograms show better results than Mel-frequency Cepstral Coefficients (MFCC), but in practice they are hard to use due to a complex dimensionality of a feature space. The following paper presents an alternative approach towards generating compressed spectrogram representation, based on Convolutional Variational Autoencoders (VAE). A Convolutional VAE model was trained on a subsample of the LibriSpeech dataset to reconstruct short fragments of audio spectrograms (25 ms) from a 13-dimensional embedding. The trained model for a 40-dimensional (300 ms) embedding was used to generate features for corpus of spoken commands on the GoogleSpeechCommands dataset. Using the generated features an ASR system was built and compared to the model with MFCC features.
许多自动语音识别(ASR)任务中,音频特征作为频谱图显示出比Mel频谱系数(MFCC)更好的结果,但在实践中,由于特征空间复杂维度,它们很难使用。以下论文提出了一种基于卷积变分自编码器(VAE)生成压缩频谱表示的方法。使用LibriSpeech数据集中的子集训练卷积变分自编码器(25ms)重构音频频谱图(13维)中的短片段。用于GoogleSpeechCommands数据集中的说话命令语料库的ASR系统使用了生成的特征。与使用MFCC特征的模型进行了比较。
https://arxiv.org/abs/2410.02560
This paper presents an overview of rule-based system for automatic accentuation and phonemic transcription of Russian texts for speech connected tasks, such as Automatic Speech Recognition (ASR). Two parts of the developed system, accentuation and transcription, use different approaches to achieve correct phonemic representations of input phrases. Accentuation is based on "Grammatical dictionary of the Russian language" of A.A. Zaliznyak and wiktionary corpus. To distinguish homographs, the accentuation system also utilises morphological information of the sentences based on Recurrent Neural Networks (RNN). Transcription algorithms apply the rules presented in the monograph of B.M. Lobanov and L.I. Tsirulnik "Computer Synthesis and Voice Cloning". The rules described in the present paper are implemented in an open-source module, which can be of use to any scientific study connected to ASR or Speech To Text (STT) tasks. Automatically marked up text annotations of the Russian Voxforge database were used as training data for an acoustic model in CMU Sphinx. The resulting acoustic model was evaluated on cross-validation, mean Word Accuracy being 71.2%. The developed toolkit is written in the Python language and is accessible on GitHub for any researcher interested.
本文概述了一个基于规则的自动重音和音素转录的俄罗斯文本系统,用于语音连接任务,如自动语音识别(ASR)。系统的两个部分,重音和转录,采用不同的方法来实现输入短语的正确音素表示。重音基于A.A. Zaliznyak和维基词典俄语的“语法词典”。为了区分同形词,重音系统还利用基于循环神经网络(RNN)的句子的语素信息。转录算法应用了B.M. Lobanov和L.I. Tsirulnik在其著作中提出的规则。本文中的规则描述在文中得到了实现,这是一个开源模块,对于与ASR或语音到文本(STT)任务相关的任何科学研究都很有用。在CMU Sphinx上使用俄语Voxforge数据库的自动标注的文本注释作为音频模型的训练数据。通过交叉验证评估该音频模型,平均单词准确率达到了71.2%。开发工具包是用Python编写的,并可公开获取在GitHub上。
https://arxiv.org/abs/2410.02538
Code-switching (CS) is the process of speakers interchanging between two or more languages which in the modern world becomes increasingly common. In order to better describe CS speech the Matrix Language Frame (MLF) theory introduces the concept of a Matrix Language, which is the language that provides the grammatical structure for a CS utterance. In this work the MLF theory was used to develop systems for Matrix Language Identity (MLID) determination. The MLID of English/Mandarin and English/Spanish CS text and speech was compared to acoustic language identity (LID), which is a typical way to identify a language in monolingual utterances. MLID predictors from audio show higher correlation with the textual principles than LID in all cases while also outperforming LID in an MLID recognition task based on F1 macro (60\%) and correlation score (0.38). This novel approach has identified that non-English languages (Mandarin and Spanish) are preferred over the English language as the ML contrary to the monolingual choice of LID.
代码切换(CS)是说话者之间在现代社会越来越常见的过程。为了更好地描述CS speech,矩阵语言框架(MLF)理论引入了矩阵语言的概念,这是提供CS陈述语法的语言。在这项工作中,MLF理论被用来开发矩阵语言身份(MLID)确定的系统。比较了英语/普通话和英语/西班牙语CS文本和语音的MLID与听觉语言身份(LID),这是一种在单语句子中识别语言的典型方法。在所有情况下,音频中的MLID预测器与文本原则的关联比LID更高,同时在基于F1宏(60\%)的MLID识别任务中也表现出比LID更好的性能。这种新颖的方法发现,与单语选择相反,非英语语言(普通话和西班牙语)在ML相反,而不是英语。
https://arxiv.org/abs/2410.02521
Medical image analysis tasks often focus on regions or structures located in a particular location within the patient's body. Often large parts of the image may not be of interest for the image analysis task. When using deep-learning based approaches, this causes an unnecessary increases the computational burden during inference and raises the chance of errors. In this paper, we introduce CTARR, a novel generic method for CT Anatomical Region Recognition. The method serves as a pre-processing step for any deep learning-based CT image analysis pipeline by automatically identifying the pre-defined anatomical region that is relevant for the follow-up task and removing the rest. It can be used in (i) image segmentation to prevent false positives in anatomically implausible regions and speeding up the inference, (ii) image classification to produce image crops that are consistent in their anatomical context, and (iii) image registration by serving as a fast pre-registration step. Our proposed method is based on atlas registration and provides a fast and robust way to crop any anatomical region encoded as one or multiple bounding box(es) from any unlabeled CT scan of the brain, chest, abdomen and/or pelvis. We demonstrate the utility and robustness of the proposed method in the context of medical image segmentation by evaluating it on six datasets of public segmentation challenges. The foreground voxels in the regions of interest are preserved in the vast majority of cases and tasks (97.45-100%) while taking only fractions of a seconds to compute (0.1-0.21s) on a deep learning workstation and greatly reducing the segmentation runtime (2.0-12.7x). Our code is available at this https URL.
医学图像分析任务通常集中于患者体内特定位置的部位或结构。通常,图像中很大一部分内容可能不适用于图像分析任务。当使用基于深度学习的图像处理方法时,这会导致在推理过程中计算负担的无谓增加,并增加错误的可能性。在本文中,我们介绍了一种名为CTARR的新通用方法,用于CT解剖区域识别。该方法作为任何基于深度学习的CT图像分析流程的预处理步骤,通过自动识别预定义的解剖区域,对后续任务相关区域进行去除。它可用于(i)图像分割,以防止在解剖上不相关区域出现假阳性结果并加快推理速度,(ii)图像分类,以生成具有相同解剖上下文的图像片段,和(iii)图像配准,作为快速预注册步骤。我们所提出的方法基于解剖配准,提供了一种快速且鲁棒的方法,从任何未标记的CT扫描中裁剪出任何编码为一个或多个边界框的解剖区域。我们在公共分割挑战数据集上评估所提出方法的应用价值。在大多数情况下,感兴趣区域的前景像素得以保留(97.45-100%),而仅需花费分数秒钟在深度学习工作站上计算(0.1-0.21s),大大减少了分割时间(2.0-12.7x)。我们的代码可在此https://url上获取。
https://arxiv.org/abs/2410.02316
Facial recognition using deep learning has been widely used in social life for applications such as authentication, smart door locks, and photo grouping, etc. More and more networks have been developed to facilitate computer vision tasks, such as ResNet, DenseNet, EfficientNet, ConvNeXt, and Siamese networks. However, few studies have systematically compared the advantages and disadvantages of such neural networks in identifying individuals from images, especially for pet animals like cats. In the present study, by systematically comparing the efficacy of different neural networks in cat recognition, we found traditional CNNs trained with transfer learning have better performance than models trained with the fine-tuning method or Siamese networks in individual cat recognition. In addition, ConvNeXt and DenseNet yield significant results which could be further optimized for individual cat recognition in pet stores and in the wild. These results provide a method to improve cat management in pet stores and monitoring of cats in the wild.
使用深度学习进行面部识别已经在社交生活中得到了广泛应用,例如身份验证、智能门锁和照片群组等。为了促进计算机视觉任务,已经开发了许多网络,如ResNet、DenseNet、EfficientNet、ConvNeXt和Siamese网络。然而,很少有研究系统地比较了这些神经网络在从图像中识别个人时的优缺点,特别是对于宠物动物(如猫)。在当前的研究中,通过系统地比较不同神经网络在猫识别方面的效果,我们发现使用迁移学习训练的传统CNN具有更好的性能,而使用微调方法或Siamese网络训练的模型在单个猫识别方面的表现较差。此外,ConvNeXt和DenseNet取得了显著的成果,这些成果可以在宠物商店和野外进一步优化用于单个猫识别。这些结果为改进宠物商店中猫的管理和野外猫的监测提供了一种方法。
https://arxiv.org/abs/2410.02305
Integrating artificial intelligence into modern society is profoundly transformative, significantly enhancing productivity by streamlining various daily tasks. AI-driven recognition systems provide notable advantages in the food sector, including improved nutrient tracking, tackling food waste, and boosting food production and consumption efficiency. Accurate food classification is a crucial initial step in utilizing advanced AI models, as the effectiveness of this process directly influences the success of subsequent operations; therefore, achieving high accuracy at a reasonable speed is essential. Despite existing research efforts, a gap persists in improving performance while ensuring rapid processing times, prompting researchers to pursue cost-effective and precise models. This study addresses this gap by employing the state-of-the-art EfficientNetB7 architecture, enhanced through transfer learning, data augmentation, and the CBAM attention module. This methodology results in a robust model that surpasses previous studies in accuracy while maintaining rapid processing suitable for real-world applications. The Food11 dataset from Kaggle was utilized, comprising 16643 imbalanced images across 11 diverse classes with significant intra-category diversities and inter-category similarities. Furthermore, the proposed methodology, bolstered by various deep learning techniques, consistently achieves an impressive average accuracy of 96.40%. Notably, it can classify over 60 images within one second during inference on unseen data, demonstrating its ability to deliver high accuracy promptly. This underscores its potential for practical applications in accurate food classification and enhancing efficiency in subsequent processes.
将人工智能融入现代社会是彻底颠覆性的,通过简化各种日常任务显著提高生产力。 AI 驱动的识别系统在食品领域具有显著优势,包括改善营养追踪、解决食品浪费和提高食品生产和消费效率。准确的食品分类是利用高级 AI 模型的关键初始步骤,因为这一过程的有效性直接影响后续操作的成功;因此,在合理的时间内实现高准确度是至关重要的。尽管现有研究已经取得了很大进展,但在保证快速处理时间的同时提高性能方面仍然存在差距,促使研究人员追求成本效益和精确的模型。本研究通过采用最先进的 EfficientNetB7 架构、通过迁移学习、数据增强和 CBAM 注意模块进行优化,来解决这一差距。这一方法产生了一个稳健的模型,在保持对真实应用场景的高准确度的同时,实现了惊人的平均准确度为 96.40%。值得注意的是,在推理时它可以将超过 60 张图像分类,证明其迅速提供高准确度的能力。这表明其在准确食品分类和提高后续过程效率的实用潜力。
https://arxiv.org/abs/2410.02304
In the domain of Aspect-Based Sentiment Analysis (ABSA), generative methods have shown promising results and achieved substantial advancements. However, despite these advancements, the tasks of extracting sentiment quadruplets, which capture the nuanced sentiment expressions within a sentence, remain significant challenges. In particular, compound sentences can potentially contain multiple quadruplets, making the extraction task increasingly difficult as sentence complexity grows. To address this issue, we are focusing on simplifying sentence structures to facilitate the easier recognition of these elements and crafting a model that integrates seamlessly with various ABSA tasks. In this paper, we propose Aspect Term Oriented Sentence Splitter (ATOSS), which simplifies compound sentence into simpler and clearer forms, thereby clarifying their structure and intent. As a plug-and-play module, this approach retains the parameters of the ABSA model while making it easier to identify essential intent within input sentences. Extensive experimental results show that utilizing ATOSS outperforms existing methods in both ASQP and ACOS tasks, which are the primary tasks for extracting sentiment quadruplets.
在面向 aspects 的情感分析(ASSA)领域,生成方法已经取得了良好的效果并取得了重大进展。然而,尽管这些进步,从句子中提取情感四元组(捕捉句子中微妙的情感表达)仍然是一项具有挑战性的任务。特别是,复杂句子可能包含多个四元组,因此随着句子复杂性的增加,提取任务变得越来越困难。为了解决这个问题,我们致力于简化句子结构,以便更容易地识别这些元素,并构建了一个模型,该模型与各种 ABSA 任务无缝集成。在本文中,我们提出了面向 aspect 的词向量句子切分器(ATOSS),该方法将复杂句子简化为更简单和清晰的形式,从而阐明了它们的结构和意图。作为可插拔的模块,这种方法保留了 ABSA 模型的参数,同时使输入句子更容易确定关键意图。大量实验结果表明,使用 ATOSS 在 ASQP 和 ACOS 任务中优于现有方法,这些任务是提取情感四元组的主要任务。
https://arxiv.org/abs/2410.02297
The Novelties corpus is a collection of novels (and parts of novels) annotated for Named Entity Recognition (NER) among other tasks. This document describes the guidelines applied during its annotation. It contains the instructions used by the annotators, as well as a number of examples retrieved from the annotated novels, and illustrating expressions that should be marked as entities as well as expressions that should not.
novelties 语料库是一个注释了小说及其部分的小说语料库,用于命名实体识别(NER)等任务。本文档描述了在注释过程中应用的指导方针。它包含了注释者使用的指令,以及从注释的小说中检索出的示例,说明了应该标记为实体以及不应该标记为实体的表达。
https://arxiv.org/abs/2410.02281
Event-based cameras are attracting significant interest as they provide rich edge information, high dynamic range, and high temporal resolution. Many state-of-the-art event-based algorithms rely on splitting the events into fixed groups, resulting in the omission of crucial temporal information, particularly when dealing with diverse motion scenarios (e.g., high/low speed). In this work, we propose SpikeSlicer, a novel-designed plug-and-play event processing method capable of splitting events stream adaptively. SpikeSlicer utilizes a lightweight (0.41M) and low-energy spiking neural network (SNN) to trigger event slicing. To guide the SNN to fire spikes at optimal time steps, we propose the Spiking Position-aware Loss (SPA-Loss) to modulate the neuron's state. Additionally, we develop a Feedback-Update training strategy that refines the slicing decisions using feedback from the downstream artificial neural network (ANN). Extensive experiments demonstrate that our method yields significant performance improvements in event-based object tracking and recognition. Notably, SpikeSlicer provides a brand-new SNN-ANN cooperation paradigm, where the SNN acts as an efficient, low-energy data processor to assist the ANN in improving downstream performance, injecting new perspectives and potential avenues of exploration.
基于事件的相机因提供丰富的边缘信息、高动态范围和高时间分辨率而受到广泛关注。许多最先进的基于事件的算法依赖于将事件划分为固定组,导致关键时间信息的遗漏,尤其是在处理多样运动场景(如高速/低速)时。在这项工作中,我们提出了SpikeSlicer,一种新型的插件和可运行的事件处理方法,具有自适应分割事件流的功能。SpikeSlicer利用轻量级(0.41M)且低能量的尖峰神经网络(SNN)触发事件切片。为了指导SNN在最佳时间步骤触发尖峰,我们提出了尖峰位置感知损失(SPA-Loss)来调节神经元状态。此外,我们还开发了反馈更新训练策略,通过下游人工神经网络(ANN)的反馈来优化切片决策。大量实验证明,我们的方法在基于事件的对象跟踪和识别方面产生了显著的性能提升。值得注意的是,SpikeSlicer提供了一种新的SNN-ANN合作范例,其中SNN充当高效、低能量的数据处理器,协助ANN提高下游性能,注入新的视点和探索途径。
https://arxiv.org/abs/2410.02249
Vision Large Language Models (VLLMs) are transforming the intersection of computer vision and natural language processing. Nonetheless, the potential of using visual prompts for emotion recognition in these models remains largely unexplored and untapped. Traditional methods in VLLMs struggle with spatial localization and often discard valuable global context. To address this problem, we propose a Set-of-Vision prompting (SoV) approach that enhances zero-shot emotion recognition by using spatial information, such as bounding boxes and facial landmarks, to mark targets precisely. SoV improves accuracy in face count and emotion categorization while preserving the enriched image context. Through a battery of experimentation and analysis of recent commercial or open-source VLLMs, we evaluate the SoV model's ability to comprehend facial expressions in natural environments. Our findings demonstrate the effectiveness of integrating spatial visual prompts into VLLMs for improving emotion recognition performance.
视觉大型语言模型(VLLMs)正在改变计算机视觉和自然语言处理之间的交叉。然而,在這些模型中使用視覺提示進行情感識別的潛力仍然被大大開發和利用。傳統的VLLM方法在空間定位方面存在困難,通常會舍棄有價值的全身上下文。為了解決這個問題,我們提出了一个Set-of-Vision prompting(SoV)方法,通過使用空間信息(如邊界框和面部 landmarks)來精確標記目標,來增強 zero-shot情感識別的準確度。SoV在face count和 emotion categorization方面改善了準確度,同時保留了豐富的情感上下文。通過對最新商業或開源VLLM進行了大量實驗和分析,我們評估了SoV模型在自然環境中理解面部表情的能力。我們的研究結果表明,將空間視覺提示集成到VLLM中可以提高情感識別性能。
https://arxiv.org/abs/2410.02244
Utterances by L2 speakers can be unintelligible due to mispronunciation and improper prosody. In computer-aided language learning systems, textual feedback is often provided using a speech recognition engine. However, an ideal form of feedback for L2 speakers should be so fine-grained that it enables them to detect and diagnose unintelligible parts of L2 speakers' utterances. Inspired by language teachers who correct students' pronunciation through a voice-to-voice process, this pilot study utilizes a unique semi-parallel dataset composed of non-native speakers' (L2) reading aloud, shadowing of native speakers (L1) and their script-shadowing utterances. We explore the technical possibility of replicating the process of an L1 speaker's shadowing L2 speech using Voice Conversion techniques, to create a virtual shadower system. Experimental results demonstrate the feasibility of the VC system in simulating L1's shadowing behavior. The output of the virtual shadower system shows a reasonable similarity to the real L1 shadowing utterances in both linguistic and acoustic aspects.
由于L2操作者的语误和不当的语调,他们的言辞可能会变得无法理解。在计算机辅助语言学习系统中,通常使用语音识别引擎提供文本反馈。然而,对于L2操作者来说,理想的反馈应如此细微,以至于他们能够检测和诊断L2操作者语篇中的不可理解部分。受到通过语音-对-语音过程纠正学生发音的语言教师的启发,本研究利用由非母语操作者(L2)朗读、shadowing母语操作者(L1)和其script-shadowing语篇组成独特的不平衡数据集。我们探讨了使用Voice Conversion技术复制L1操作者影子L2语篇的过程的可行性,以创建一个虚拟阴影系统。实验结果表明,VC系统可以模拟L1操作者的影子行为。虚拟阴影系统的输出在语言和声学方面与真实L1影子语篇具有合理的相似性。
https://arxiv.org/abs/2410.02239
Accurate real-time tracking of dexterous hand movements and interactions has numerous applications in human-computer interaction, metaverse, robotics, and tele-health. Capturing realistic hand movements is challenging because of the large number of articulations and degrees of freedom. Here, we report accurate and dynamic tracking of articulated hand and finger movements using stretchable, washable smart gloves with embedded helical sensor yarns and inertial measurement units. The sensor yarns have a high dynamic range, responding to low 0.005 % to high 155 % strains, and show stability during extensive use and washing cycles. We use multi-stage machine learning to report average joint angle estimation root mean square errors of 1.21 and 1.45 degrees for intra- and inter-subjects cross-validation, respectively, matching accuracy of costly motion capture cameras without occlusion or field of view limitations. We report a data augmentation technique that enhances robustness to noise and variations of sensors. We demonstrate accurate tracking of dexterous hand movements during object interactions, opening new avenues of applications including accurate typing on a mock paper keyboard, recognition of complex dynamic and static gestures adapted from American Sign Language and object identification.
准确实时追踪灵活的手部运动和交互在人类-计算机交互、元宇宙、机器人和远程医疗等领域具有许多应用价值。由于手部动作的数量和自由度较大,捕捉真实的握持动作具有挑战性。在这里,我们报道了一种使用具有可伸缩、可清洗的智能手套以及嵌入的螺旋传感器纤维的准确而动态追踪具有关节和手指的运动。传感器纤维具有高动态范围,响应于低0.005%至高155%的应变,并在大范围使用和清洗周期中表现出稳定性。我们使用多级机器学习来报告跨个体交叉验证的平均关节角度估计根方差分别为1.21度和1.45度,分别与没有遮挡或视野限制的昂贵运动捕捉摄像机相匹配,具有与昂贵相机相媲美的准确性。我们还报道了一种增强传感器噪声和灵敏度变化的数据增强技术。我们展示了在物体交互过程中准确追踪灵活的手部动作,开辟了包括在模拟纸键盘上准确打字、从美国手语识别复杂动态和静态手势以及物体识别等新应用领域的道路。
https://arxiv.org/abs/2410.02221
Arabic handwritten text recognition (HTR) is challenging, especially for historical texts, due to diverse writing styles and the intrinsic features of Arabic script. Additionally, Arabic handwriting datasets are smaller compared to English ones, making it difficult to train generalizable Arabic HTR models. To address these challenges, we propose HATFormer, a transformer-based encoder-decoder architecture that builds on a state-of-the-art English HTR model. By leveraging the transformer's attention mechanism, HATFormer captures spatial contextual information to address the intrinsic challenges of Arabic script through differentiating cursive characters, decomposing visual representations, and identifying diacritics. Our customization to historical handwritten Arabic includes an image processor for effective ViT information preprocessing, a text tokenizer for compact Arabic text representation, and a training pipeline that accounts for a limited amount of historic Arabic handwriting data. HATFormer achieves a character error rate (CER) of 8.6% on the largest public historical handwritten Arabic dataset, with a 51% improvement over the best baseline in the literature. HATFormer also attains a comparable CER of 4.2% on the largest private non-historical dataset. Our work demonstrates the feasibility of adapting an English HTR method to a low-resource language with complex, language-specific challenges, contributing to advancements in document digitization, information retrieval, and cultural preservation.
阿拉伯手写文本识别(HTR)具有挑战性,尤其是在历史文本中,因为阿拉伯文本具有多样性的书写风格和阿拉伯文字本特征。此外,阿拉伯手写数据集比英语数据集要小,这使得训练具有泛化能力的阿拉伯HTR模型变得困难。为了应对这些挑战,我们提出了HATFormer,一种基于最先进的英语HTR模型的Transformer编码器-解码器架构。通过利用Transformer的注意力机制,HATFormer捕捉到阿拉伯文本的空间上下文信息,通过区分手写字符、分解视觉表示和识别变体来解决阿拉伯文字本固有的挑战。我们对历史手写阿拉伯的定制包括一个有效的ViT信息预处理图像处理器、一个紧凑的阿拉伯文本词条izer和一个考虑有限历史阿拉伯手写数据训练工作流的训练管道。HATFormer在最大的公共历史手写阿拉伯数据集上的字符错误率(CER)为8.6%,在文献中的最佳基线上的性能提高了51%。HATFormer还在最大的私人非历史数据集上获得了与文献中类似且可比的CER,为将英语HTR方法应用于具有复杂、语言特定挑战的低资源语言奠定了基础,有助于促进文档数字化、信息检索和文化遗产保护的发展。
https://arxiv.org/abs/2410.02179
In this work, we explore the possibility of using synthetically generated data for video-based gesture recognition with large pre-trained models. We consider whether these models have sufficiently robust and expressive representation spaces to enable "training-free" classification. Specifically, we utilize various state-of-the-art video encoders to extract features for use in k-nearest neighbors classification, where the training data points are derived from synthetic videos only. We compare these results with another training-free approach -- zero-shot classification using text descriptions of each gesture. In our experiments with the RoCoG-v2 dataset, we find that using synthetic training videos yields significantly lower classification accuracy on real test videos compared to using a relatively small number of real training videos. We also observe that video backbones that were fine-tuned on classification tasks serve as superior feature extractors, and that the choice of fine-tuning data has a substantial impact on k-nearest neighbors performance. Lastly, we find that zero-shot text-based classification performs poorly on the gesture recognition task, as gestures are not easily described through natural language.
在这项工作中,我们探讨了使用合成数据进行视频为基础的手势识别的可能性,与大型预训练模型。我们考虑这些模型是否具有足够稳健和富有表现力的表示空间,以实现“无需训练”分类。具体来说,我们利用各种最先进的视频编码器提取用于k近邻分类的特征,其中训练数据点仅来源于合成视频。我们将这些结果与另一种无需训练的方法——基于文本描述的手势进行零散分类进行比较。在我们的实验中,使用合成训练视频在真实测试视频中,分类准确性明显低于使用相对较小的真实训练视频。我们还观察到,在分类任务上对视频骨干进行微调作为出色的特征提取器,而选择微调数据对k近邻性能有很大的影响。最后,我们发现,基于零散文本的手势识别任务表现不佳,因为手势不易用自然语言描述。
https://arxiv.org/abs/2410.02152
It has been reported that LLMs can recognize their own writing. As this has potential implications for AI safety, yet is relatively understudied, we investigate the phenomenon, seeking to establish whether it robustly occurs at the behavioral level, how the observed behavior is achieved, and whether it can be controlled. First, we find that the Llama3-8b-Instruct chat model - but not the base Llama3-8b model - can reliably distinguish its own outputs from those of humans, and present evidence that the chat model is likely using its experience with its own outputs, acquired during post-training, to succeed at the writing recognition task. Second, we identify a vector in the residual stream of the model that is differentially activated when the model makes a correct self-written-text recognition judgment, show that the vector activates in response to information relevant to self-authorship, present evidence that the vector is related to the concept of "self" in the model, and demonstrate that the vector is causally related to the model's ability to perceive and assert self-authorship. Finally, we show that the vector can be used to control both the model's behavior and its perception, steering the model to claim or disclaim authorship by applying the vector to the model's output as it generates it, and steering the model to believe or disbelieve it wrote arbitrary texts by applying the vector to them as the model reads them.
据报道,LLM可以识别出自己的写作。由于这对人工智能的安全具有潜在影响,但研究还相对较少,我们研究了这一现象,试图确定它是否在行为层面上 robustly 发生,以及如何实现所观察到的行为,是否可以控制。首先,我们发现Llama3-8b-Instruct聊天模型 - 而不是基础Llama3-8b模型 - 可以在可靠地区分其自身输出与人类输出之间建立联系,并提供了证据,表明聊天模型很可能利用自己在训练后获得的经验来在写作识别任务中取得成功。其次,我们在模型的残差流中找到了一个与模型正确自写文本识别判断相关的向量,并展示了该向量在自作者性相关信息时被激活。我们还提供了证据表明该向量与模型中“自我”的概念相关,并表明该向量与模型感知和断言自作者性能力的能力相关。最后,我们证明了该向量可以用于控制模型的行为和感知,通过将该向量应用于模型生成其输出时的行为,引导模型声称或否认作者身份,或者通过将该向量应用于模型读取其内容时的行为,引导模型相信或不相信它写的是任意文本。
https://arxiv.org/abs/2410.02064
Diabetic foot ulcers (DFUs) are a leading cause of hospitalizations and lower limb amputations, placing a substantial burden on patients and healthcare systems. Early detection and accurate classification of DFUs are critical for preventing serious complications, yet many patients experience delays in receiving care due to limited access to specialized services. Telehealth has emerged as a promising solution, improving access to care and reducing the need for in-person visits. The integration of artificial intelligence and pattern recognition into telemedicine has further enhanced DFU management by enabling automatic detection, classification, and monitoring from images. Despite advancements in artificial intelligence-driven approaches for DFU image analysis, the application of large language models for DFU image transcription has not yet been explored. To address this gap, we introduce UlcerGPT, a novel multimodal approach leveraging large language and vision models for DFU image transcription. This framework combines advanced vision and language models, such as Large Language and Vision Assistant and Chat Generative Pre-trained Transformer, to transcribe DFU images by jointly detecting, classifying, and localizing regions of interest. Through detailed experiments on a public dataset, evaluated by expert clinicians, UlcerGPT demonstrates promising results in the accuracy and efficiency of DFU transcription, offering potential support for clinicians in delivering timely care via telemedicine.
糖尿病足溃疡(DFUs)是医院化和下肢截肢的领先原因,对患者和医疗系统造成了沉重的负担。早期诊断和准确的分类DFUs对于预防严重并发症至关重要,然而许多患者由于获得专业服务受限而经历延迟接受治疗。远程医疗已成为一个有前景的解决方案,通过改善获得医疗服务的可访问性并减少需要亲自就诊,提高了医疗服务的可及性。将人工智能和模式识别融入远程医疗,进一步提高了DFU管理,通过使图像自动检测、分类和监测,从而实现这一目标。尽管在人工智能驱动的DFU图像分析方面取得了进步,但应用大型语言模型进行DFU图像转录的应用还尚不清楚。为了填补这一空白,我们引入了UlcerGPT,一种利用大型语言和视觉模型协同检测、分类和定位兴趣区域的全新多模态方法。这个框架结合了大型语言和视觉模型,如Large Language和 Vision Assistant和Chat Generative Pre-trained Transformer,通过共同检测、分类和定位感兴趣的区域对DFU图像进行转录。通过对一个公开数据集的详细实验,由专家临床医生进行评估,UlcerGPT在DFU转录的准确性和效率方面显示出良好的结果,为医生通过远程医疗及时交付护理提供了潜在支持。
https://arxiv.org/abs/2410.01989
Detecting human actions is a crucial task for autonomous robots and vehicles, often requiring the integration of various data modalities for improved accuracy. In this study, we introduce a novel approach to Human Action Recognition (HAR) based on skeleton and visual cues. Our method leverages a language model to guide the feature extraction process in the skeleton encoder. Specifically, we employ learnable prompts for the language model conditioned on the skeleton modality to optimize feature representation. Furthermore, we propose a fusion mechanism that combines dual-modality features using a salient fusion module, incorporating attention and transformer mechanisms to address the modalities' high dimensionality. This fusion process prioritizes informative video frames and body joints, enhancing the recognition accuracy of human actions. Additionally, we introduce a new dataset tailored for real-world robotic applications in construction sites, featuring visual, skeleton, and depth data modalities, named VolvoConstAct. This dataset serves to facilitate the training and evaluation of machine learning models to instruct autonomous construction machines for performing necessary tasks in the real world construction zones. To evaluate our approach, we conduct experiments on our dataset as well as three widely used public datasets, NTU-RGB+D, NTU-RGB+D120 and NW-UCLA. Results reveal that our proposed method achieves promising performance across all datasets, demonstrating its robustness and potential for various applications. The codes and dataset are available at: this https URL
检测人类行为是自动驾驶机器人车辆的关键任务,通常需要将各种数据模式进行集成以提高准确性。在这项研究中,我们提出了一种基于骨架和视觉线索的新人机行为识别(HAR)方法。我们的方法利用语言模型引导骨架编码器的特征提取过程。具体来说,我们使用条件于骨架模态的learnable prompts来优化特征表示。此外,我们提出了一种融合机制,使用显著性融合模块结合注意力和Transformer机制来处理模态的高维度。这个融合过程优先考虑视频帧和身体关节的有用信息,提高了人类行为的识别准确性。此外,我们还引入了一个新的数据集,名为VolvoConstAct,专门针对实境建筑工地进行设计,包括视觉、骨架和深度数据模式。这个数据集有助于指导自主建筑机器人在现实世界的建筑区执行必要任务。为了评估我们的方法,我们在我们的数据集以及三个广泛使用的主流公共数据集(NTU-RGB+D,NTU-RGB+D120和NW-UCLA)上进行了实验。结果表明,我们提出的方法在所有数据集上都取得了良好的性能,证明了其稳健性和各种应用的前景。代码和数据集可在此处访问:https://this URL
https://arxiv.org/abs/2410.01962