Our objective is to translate continuous sign language into spoken language text. Inspired by the way human interpreters rely on context for accurate translation, we incorporate additional contextual cues together with the signing video, into a new translation framework. Specifically, besides visual sign recognition features that encode the input video, we integrate complementary textual information from (i) captions describing the background show, (ii) translation of previous sentences, as well as (iii) pseudo-glosses transcribing the signing. These are automatically extracted and inputted along with the visual features to a pre-trained large language model (LLM), which we fine-tune to generate spoken language translations in text form. Through extensive ablation studies, we show the positive contribution of each input cue to the translation performance. We train and evaluate our approach on BOBSL -- the largest British Sign Language dataset currently available. We show that our contextual approach significantly enhances the quality of the translations compared to previously reported results on BOBSL, and also to state-of-the-art methods that we implement as baselines. Furthermore, we demonstrate the generality of our approach by applying it also to How2Sign, an American Sign Language dataset, and achieve competitive results.
我们的目标是将连续的手语翻译成口语文本。受人类口译员依赖上下文进行准确翻译的启发,我们将额外的上下文线索与手语视频整合到一个新的翻译框架中。具体来说,在编码输入视频的手势识别特征之外,我们还集成了三种补充性的文本信息:(i)描述背景节目的字幕;(ii)前一句的口语翻译;以及(iii)转录手势的伪术语。这些信息被自动提取并与视觉特征一起输入到预训练的大语言模型(LLM)中,该模型经过微调后能够生成口语形式的文本翻译。通过大量的消融研究,我们展示了每种输入线索对翻译性能的正面贡献。我们在BOBSL——目前最大的英国手语数据集上进行训练和评估。结果显示,我们的上下文方法显著提高了在BOBSL上的翻译质量,并且优于之前报道的结果以及作为基线实现的最新技术方法。此外,我们通过将其应用于How2Sign(一个美国手语数据集)来展示该方法的通用性,并取得了具有竞争力的结果。
https://arxiv.org/abs/2501.09754
The objective of BioCreative8 Track 3 is to extract phenotypic key medical findings embedded within EHR texts and subsequently normalize these findings to their Human Phenotype Ontology (HPO) terms. However, the presence of diverse surface forms in phenotypic findings makes it challenging to accurately normalize them to the correct HPO terms. To address this challenge, we explored various models for named entity recognition and implemented data augmentation techniques such as synonym marginalization to enhance the normalization step. Our pipeline resulted in an exact extraction and normalization F1 score 2.6\% higher than the mean score of all submissions received in response to the challenge. Furthermore, in terms of the normalization F1 score, our approach surpassed the average performance by 1.9\%. These findings contribute to the advancement of automated medical data extraction and normalization techniques, showcasing potential pathways for future research and application in the biomedical domain.
BioCreative8 Track 3的目标是从电子健康记录(EHR)文本中提取关键的医学表型发现,并将其归一化为人类表型本体论(Human Phenotype Ontology,HPO)术语。然而,由于表型发现表面形式的多样性,准确地将它们归一化到正确的HPO术语上存在挑战性。为了应对这一挑战,我们探索了多种命名实体识别模型,并实施了数据增强技术如同义词边际化来提升归一化的步骤。我们的流水线最终在精确提取和归一化的F1评分方面比所有对挑战做出回应的提交的平均得分高出2.6%。此外,在归一化F1评分上,我们的方法超出了平均水平1.9%。这些发现有助于自动医学数据提取和归一化技术的发展,并展示了未来研究和生物医学领域应用的潜在路径。
https://arxiv.org/abs/2501.09744
For privacy and security concerns, the need to erase unwanted information from pre-trained vision models is becoming evident nowadays. In real-world scenarios, erasure requests originate at any time from both users and model owners, and these requests usually form a sequence. Therefore, under such a setting, selective information is expected to be continuously removed from a pre-trained model while maintaining the rest. We define this problem as continual forgetting and identify three key challenges. (i) For unwanted knowledge, efficient and effective deleting is crucial. (ii) For remaining knowledge, the impact brought by the forgetting procedure should be minimal. (iii) In real-world scenarios, the training samples may be scarce or partially missing during the process of forgetting. To address them, we first propose Group Sparse LoRA (GS-LoRA). Specifically, towards (i), we introduce LoRA modules to fine-tune the FFN layers in Transformer blocks for each forgetting task independently, and towards (ii), a simple group sparse regularization is adopted, enabling automatic selection of specific LoRA groups and zeroing out the others. To further extend GS-LoRA to more practical scenarios, we incorporate prototype information as additional supervision and introduce a more practical approach, GS-LoRA++. For each forgotten class, we move the logits away from its original prototype. For the remaining classes, we pull the logits closer to their respective prototypes. We conduct extensive experiments on face recognition, object detection and image classification and demonstrate that our method manages to forget specific classes with minimal impact on other classes. Codes have been released on this https URL.
出于隐私和安全方面的考虑,从预训练的视觉模型中删除不需要的信息的需求变得越来越明显。在现实场景中,用户和模型所有者随时都可能提出擦除请求,并且这些请求通常形成一个序列。因此,在这种设置下,期望能够持续地从预训练模型中移除特定信息的同时保持其余部分不受影响。我们将这个问题定义为连续遗忘问题,并识别出三个关键挑战。(i)对于不需要的知识,高效的删除方法至关重要。(ii)对于保留下来的知识,遗忘过程带来的负面影响应该最小化。(iii)在现实场景中,在遗忘过程中可用的训练样本可能非常有限或不完整。 为了应对这些挑战,我们首先提出了组稀疏LoRA(GS-LoRA)。具体来说,针对(i),我们引入了用于独立微调Transformer块中的FFN层的LoRA模块,并且对于(ii),采用了简单的组稀疏正则化方法,从而能够自动选择特定的LoRA组并将其他部分置零。为了将GS-LoRA进一步扩展到更多实际场景中使用,我们将原型信息作为额外监督引入,并提出了一种更实用的方法——GS-LoRA++。对于每个被遗忘的类别,我们将其logits远离其原始原型;而对于剩余的类别,则吸引它们各自的原型。我们在人脸识别、目标检测和图像分类上进行了广泛的实验,证明我们的方法能够以最小影响从特定类中进行遗忘操作。 代码已经在以下网址发布:[此链接处应填写实际提供的GitHub或相关代码存储库URL]。
https://arxiv.org/abs/2501.09705
Face recognition technology has dramatically transformed the landscape of security, surveillance, and authentication systems, offering a user-friendly and non-invasive biometric solution. However, despite its significant advantages, face recognition systems face increasing threats from physical and digital spoofing attacks. Current research typically treats face recognition and attack detection as distinct classification challenges. This approach necessitates the implementation of separate models for each task, leading to considerable computational complexity, particularly on devices with limited resources. Such inefficiencies can stifle scalability and hinder performance. In response to these challenges, this paper introduces an innovative unified model designed for face recognition and detection of physical and digital attacks. By leveraging the advanced Swin Transformer backbone and incorporating HiLo attention in a convolutional neural network framework, we address unified face recognition and spoof attack detection more effectively. Moreover, we introduce augmentation techniques that replicate the traits of physical and digital spoofing cues, significantly enhancing our model robustness. Through comprehensive experimental evaluation across various datasets, we showcase the effectiveness of our model in unified face recognition and spoof detection. Additionally, we confirm its resilience against unseen physical and digital spoofing attacks, underscoring its potential for real-world applications.
面部识别技术已显著改变了安全、监控和认证系统的格局,提供了一种用户友好且非侵入性的生物特征解决方案。然而,尽管其具有明显的优势,但面部识别系统面临着来自物理和数字伪造攻击的日益增加的威胁。目前的研究通常将面部识别与攻击检测视为两个独立的分类挑战。这种方法需要为每个任务实施单独的模型,导致计算复杂性大幅增加,尤其是在资源有限的设备上。这种低效会限制可扩展性并阻碍性能。 为了应对这些挑战,本文介绍了一种创新的一体化模型,用于面部识别和物理及数字攻击检测。通过利用先进的Swin Transformer骨干网,并在卷积神经网络框架中融入HiLo注意力机制,我们更有效地解决了统一的面部识别和伪造攻击检测问题。此外,我们引入了增强技术来复制物理和数字伪造线索的特点,大大增强了模型的鲁棒性。 通过跨多种数据集进行全面实验评估,我们展示了我们的模型在统一面部识别和伪造检测方面的有效性。另外,我们也确认了该模型对未见过的物理及数字伪造攻击具有抗御能力,突显其在实际应用中的潜力。
https://arxiv.org/abs/2501.09635
Purpose: Surgical workflow analysis is crucial for improving surgical efficiency and safety. However, previous studies rely heavily on large-scale annotated datasets, posing challenges in cost, scalability, and reliance on expert annotations. To address this, we propose Surg-FTDA (Few-shot Text-driven Adaptation), designed to handle various surgical workflow analysis tasks with minimal paired image-label data. Methods: Our approach has two key components. First, Few-shot selection-based modality alignment selects a small subset of images and aligns their embeddings with text embeddings from the downstream task, bridging the modality gap. Second, Text-driven adaptation leverages only text data to train a decoder, eliminating the need for paired image-text data. This decoder is then applied to aligned image embeddings, enabling image-related tasks without explicit image-text pairs. Results: We evaluate our approach to generative tasks (image captioning) and discriminative tasks (triplet recognition and phase recognition). Results show that Surg-FTDA outperforms baselines and generalizes well across downstream tasks. Conclusion: We propose a text-driven adaptation approach that mitigates the modality gap and handles multiple downstream tasks in surgical workflow analysis, with minimal reliance on large annotated datasets. The code and dataset will be released in this https URL.
翻译: 目的:手术工作流程分析对于提高手术效率和安全性至关重要。然而,以往的研究严重依赖大规模标注数据集,在成本、可扩展性和对专家注释的依赖方面存在挑战。为了应对这一问题,我们提出了Surg-FTDA(少量样本文本驱动适应),旨在仅使用少量配对图像标签数据来处理各种手术工作流程分析任务。 方法:我们的方法包含两个关键组成部分。首先,“基于少量样本选择的模态对齐”选取一小部分图像,并将其嵌入与下游任务中的文本嵌入对齐,以此弥合了模态差距。其次,“文本驱动适应”仅利用文本数据训练解码器,从而无需配对的图像-文本数据。然后将此解码器应用于对齐后的图像嵌入中,使在没有明确图像-文本对的情况下也能执行与图像相关的任务。 结果:我们评估了Surg-FTDA在生成性任务(图像描述)和判别性任务(三元组识别和阶段识别)中的表现。结果显示,Surg-FTDA优于基准方法,并且能够很好地泛化到下游任务中。结论:我们提出了一种文本驱动适应的方法,该方法减轻了模态差距并处理了手术工作流程分析的多个下游任务,同时大大减少了对大规模标注数据集的依赖。代码和数据集将在此网址发布(注:原文中没有提供具体的URL链接)。
https://arxiv.org/abs/2501.09555
Code-switching, the alternation of languages within a single discourse, presents a significant challenge for Automatic Speech Recognition. Despite the unique nature of the task, performance is commonly measured with established metrics such as Word-Error-Rate (WER). However, in this paper, we question whether these general metrics accurately assess performance on code-switching. Specifically, using both Connectionist-Temporal-Classification and Encoder-Decoder models, we show fine-tuning on non-code-switched data from both matrix and embedded language improves classical metrics on code-switching test sets, although actual code-switched words worsen (as expected). Therefore, we propose Point-of-Interest Error Rate (PIER), a variant of WER that focuses only on specific words of interest. We instantiate PIER on code-switched utterances and show that this more accurately describes the code-switching performance, showing huge room for improvement in future work. This focused evaluation allows for a more precise assessment of model performance, particularly in challenging aspects such as inter-word and intra-word code-switching.
代码转换(code-switching),即在一个对话中交替使用多种语言,对自动语音识别技术提出了重大挑战。尽管任务具有独特性,但性能通常还是通过诸如单词错误率(Word-Error-Rate, WER)等既定指标来衡量的。然而,在本文中我们质疑这些通用度量是否能够准确评估代码转换的表现。 具体而言,基于连接主义时间分类和编码器-解码器模型,我们展示了在来自两种语言(矩阵语言和嵌入式语言)而非代码混合数据上的微调可以改善经典指标对代码转换测试集的性能。然而,实际上,在这些场景中涉及的代码切换词语的表现却更差(符合预期)。因此,本文提出了一个单词错误率(WER)变体——兴趣点误差率(Point-of-Interest Error Rate, PIER),该指标专门针对特定感兴趣的词进行评估。 我们利用PIER来分析代码转换中的特定词汇表现,并证明这种方法能更准确地描述模型在处理代码切换时的表现,指出未来改进的潜力巨大。这种有针对性的评估方式使我们可以更加精确地衡量模型性能,尤其是在跨词和词语内部的代码切换这些挑战性领域。
https://arxiv.org/abs/2501.09512
Understanding emotions accurately is essential for fields like human-computer interaction. Due to the complexity of emotions and their multi-modal nature (e.g., emotions are influenced by facial expressions and audio), researchers have turned to using multi-modal models to understand human emotions rather than single-modality. However, current video multi-modal large language models (MLLMs) encounter difficulties in effectively integrating audio and identifying subtle facial micro-expressions. Furthermore, the lack of detailed emotion analysis datasets also limits the development of multimodal emotion analysis. To address these issues, we introduce a self-reviewed dataset and a human-reviewed dataset, comprising 24,137 coarse-grained samples and 3,500 manually annotated samples with detailed emotion annotations, respectively. These datasets allow models to learn from diverse scenarios and better generalize to real-world applications. Moreover, in addition to the audio modeling, we propose to explicitly integrate facial encoding models into the existing advanced Video MLLM, enabling the MLLM to effectively unify audio and the subtle facial cues for emotion understanding. By aligning these features within a unified space and employing instruction tuning in our proposed datasets, our Omni-Emotion achieves state-of-the-art performance in both emotion recognition and reasoning tasks.
准确理解情感对于人机交互等领域来说至关重要。由于情绪的复杂性和多模态特性(例如,情绪会受到面部表情和音频的影响),研究人员已经转向使用多模态模型来理解和分析人类情绪,而不是单一模式的方法。然而,目前的视频多模态大语言模型在有效地融合音频数据以及识别细微的面部微表情方面遇到了困难。此外,缺乏详细的多模态情感分析数据集也限制了该领域的发展。 为了解决这些问题,我们引入了一个自我审查的数据集和一个人工审查的数据集,分别包含了24,137个粗粒度样本和3,500个详细标注的情感样本。这些数据集使模型能够从各种场景中学习,并更好地泛化到实际应用中去。 此外,在音频建模之外,我们提议将面部编码模型明确地整合到现有的先进视频多模态大语言模型(Video MLLM)之中,使得该模型能有效地统一音频和细微的面部线索进行情感理解。通过在提出的这些数据集中对特征进行空间上的对齐,并采用指令调优方法,我们的Omni-Emotion系统在情绪识别和推理任务中均达到了当前的最佳性能水平。
https://arxiv.org/abs/2501.09502
We explore the impact of aperture size and shape on automotive camera systems for deep-learning-based tasks like traffic sign recognition and light state detection. A method is proposed to simulate optical effects using the point spread function (PSF), enhancing realism and reducing the domain gap between synthetic and real-world images. Computer-generated scenes are refined with this technique to model optical distortions and improve simulation accuracy.
我们探讨了光圈大小和形状对基于深度学习的汽车相机系统任务(如交通标志识别和灯光状态检测)的影响。提出了一种使用点扩散函数(PSF)模拟光学效果的方法,以增强现实感并减少合成图像与真实世界图像之间的领域差距。通过这项技术,计算机生成的场景得到了改进,用于建模光学畸变并提高仿真精度。
https://arxiv.org/abs/2501.09456
This paper investigates the robustness of vision-language models against adversarial visual perturbations and introduces a novel ``double visual defense" to enhance this robustness. Unlike previous approaches that resort to lightweight adversarial fine-tuning of a pre-trained CLIP model, we perform large-scale adversarial vision-language pre-training from scratch using web-scale data. We then strengthen the defense by incorporating adversarial visual instruction tuning. The resulting models from each stage, $\Delta$CLIP and $\Delta^2$LLaVA, show substantially enhanced zero-shot robustness and set a new state-of-the-art in adversarial defense for vision-language models. For example, the adversarial robustness of $\Delta$CLIP surpasses that of the previous best models on ImageNet-1k by ~20%. %For example, $\Delta$CLIP surpasses the previous best models on ImageNet-1k by ~20% in terms of adversarial robustness. Similarly, compared to prior art, $\Delta^2$LLaVA brings a ~30% robustness improvement to image captioning task and a ~20% robustness improvement to visual question answering task. Furthermore, our models exhibit stronger zero-shot recognition capability, fewer hallucinations, and superior reasoning performance compared to baselines. Our project page is this https URL.
这篇论文探讨了视觉-语言模型在对抗性视觉干扰下的鲁棒性,并引入了一种新颖的“双重视觉防御”方法,以增强这种鲁棒性。与以往依赖于轻量级对抗微调预训练CLIP模型的方法不同,我们使用网络规模的数据从头开始进行了大规模的对抗性视觉-语言预训练。然后通过加入对抗性视觉指令调整来加强防护措施。在每个阶段生成的模型$\Delta$CLIP和$\Delta^2$LLaVA显示出了显著增强的零样本鲁棒性,并且在对抗防御方面为视觉-语言模型设定了新的最佳状态。例如,$\Delta$CLIP在ImageNet-1k上的对抗鲁棒性比之前的最好模型高约20%。同样地,与先前的方法相比,$\Delta^2$LLaVA在图像描述任务上带来了大约30%的鲁棒性改进,在视觉问答任务上带来了大约20%的鲁棒性改进。此外,我们的模型还展示了更强的零样本识别能力、更少的幻觉现象以及比基准方法更优越的推理性能。我们的项目页面是这个网址:[请在此处插入正确的URL]。
https://arxiv.org/abs/2501.09446
Foundation models have revolutionized computer vision by achieving vastly superior performance across diverse tasks through large-scale pretraining on extensive datasets. However, their application in surgical computer vision has been limited. This study addresses this gap by introducing SurgeNetXL, a novel surgical foundation model that sets a new benchmark in surgical computer vision. Trained on the largest reported surgical dataset to date, comprising over 4.7 million video frames, SurgeNetXL achieves consistent top-tier performance across six datasets spanning four surgical procedures and three tasks, including semantic segmentation, phase recognition, and critical view of safety (CVS) classification. Compared with the best-performing surgical foundation models, SurgeNetXL shows mean improvements of 2.4, 9.0, and 12.6 percent for semantic segmentation, phase recognition, and CVS classification, respectively. Additionally, SurgeNetXL outperforms the best-performing ImageNet-based variants by 14.4, 4.0, and 1.6 percent in the respective tasks. In addition to advancing model performance, this study provides key insights into scaling pretraining datasets, extending training durations, and optimizing model architectures specifically for surgical computer vision. These findings pave the way for improved generalizability and robustness in data-scarce scenarios, offering a comprehensive framework for future research in this domain. All models and a subset of the SurgeNetXL dataset, including over 2 million video frames, are publicly available at: this https URL.
基础模型通过在大规模数据集上的预训练,已在计算机视觉领域实现了跨多种任务的卓越性能。然而,在手术计算机视觉领域的应用却相对有限。本研究旨在填补这一空白,引入了SurgeNetXL,这是一种新型的手术基础模型,并为手术计算机视觉设定了新的基准。该模型是在迄今为止报道的最大规模的手术数据集上训练出来的,包含超过470万帧视频图像。SurgeNetXL在涵盖四个手术程序和三个任务(语义分割、阶段识别以及关键安全视图(CVS)分类)的六个数据集中均表现出持续领先的成绩。 相较于目前表现最佳的手术基础模型,SurgeNetXL在语义分割、阶段识别及CVS分类上分别提高了2.4%,9.0%和12.6%。此外,在各自的任务中,与基于ImageNet的最佳变体相比,SurgeNetXL的表现也高出14.4%,4.0%以及1.6%。 除提升模型性能外,本研究还提供了有关如何扩大预训练数据集规模、延长训练时长及优化手术计算机视觉领域中的模型架构的关键见解。这些发现为在数据稀缺场景下提高通用性和鲁棒性铺平了道路,并为该领域的未来研究提供了一个全面的框架。 所有模型以及SurgeNetXL数据集中的一部分(包括超过200万帧视频图像)均可从以下网址公开获取:[此链接](https://thishttpsURL.com)。
https://arxiv.org/abs/2501.09436
Zero-shot recognition models require extensive training data for generalization. However, in zero-shot 3D classification, collecting 3D data and captions is costly and laborintensive, posing a significant barrier compared to 2D vision. Recent advances in generative models have achieved unprecedented realism in synthetic data production, and recent research shows the potential for using generated data as training data. Here, naturally raising the question: Can synthetic 3D data generated by generative models be used as expanding limited 3D datasets? In response, we present a synthetic 3D dataset expansion method, Textguided Geometric Augmentation (TeGA). TeGA is tailored for language-image-3D pretraining, which achieves SoTA in zero-shot 3D classification, and uses a generative textto-3D model to enhance and extend limited 3D datasets. Specifically, we automatically generate text-guided synthetic 3D data and introduce a consistency filtering strategy to discard noisy samples where semantics and geometric shapes do not match with text. In the experiment to double the original dataset size using TeGA, our approach demonstrates improvements over the baselines, achieving zeroshot performance gains of 3.0% on Objaverse-LVIS, 4.6% on ScanObjectNN, and 8.7% on ModelNet40. These results demonstrate that TeGA effectively bridges the 3D data gap, enabling robust zero-shot 3D classification even with limited real training data and paving the way for zero-shot 3D vision application.
零样本识别模型需要大量的训练数据来实现泛化能力。然而,在零样本三维分类中,收集三维数据和描述信息的成本高昂且费时,相较于二维视觉任务而言这是一个显著的障碍。近期在生成模型方面的进展实现了前所未有的合成数据的真实感,最近的研究表明可以利用这些生成的数据作为训练数据使用。这自然引出了一个问题:通过生成模型创建的合成三维数据能否用于扩展有限的三维数据集? 为此,我们提出了一种基于文本引导几何增强(Text-guided Geometric Augmentation, TeGA)的方法来扩充有限的3D数据集。TeGA专门针对语言-图像-3D预训练设计,在零样本3D分类中达到了最先进的性能,并利用生成的文本到三维模型来提升和扩展受限的3D数据集。 具体而言,我们自动根据描述生成合成的3D数据,并引入了一种一致性过滤策略以剔除语义或几何形状与文本不匹配的噪声样本。在使用TeGA将原始数据集大小翻倍的实验中,我们的方法相较于基准线表现出显著改进,在Objaverse-LVIS上实现了零样本性能提升3.0%,在ScanObjectNN上为4.6%,而在ModelNet40上则达到了8.7%。 这些结果表明,TeGA有效地填补了三维数据的缺口,并且即使是在有限的真实训练数据的情况下,也能实现稳健的零样本三维分类。这为进一步开展零样本三维视觉应用铺平了道路。
https://arxiv.org/abs/2501.09278
Vision-based tactile sensors have drawn increasing interest in the robotics community. However, traditional lens-based designs impose minimum thickness constraints on these sensors, limiting their applicability in space-restricted settings. In this paper, we propose ThinTact, a novel lensless vision-based tactile sensor with a sensing field of over 200 mm2 and a thickness of less than 10 this http URL utilizes the mask-based lensless imaging technique to map the contact information to CMOS signals. To ensure real-time tactile sensing, we propose a real-time lensless reconstruction algorithm that leverages a frequency-spatial-domain joint filter based on discrete cosine transform (DCT). This algorithm achieves computation significantly faster than existing optimization-based methods. Additionally, to improve the sensing quality, we develop a mask optimization method based on the generic algorithm and the corresponding system matrix calibration this http URL evaluate the performance of our proposed lensless reconstruction and tactile sensing through qualitative and quantitative experiments. Furthermore, we demonstrate ThinTact's practical applicability in diverse applications, including texture recognition and contact-rich object manipulation. The paper will appear in the IEEE Transactions on Robotics: this https URL. Video: this https URL
基于视觉的触觉传感器在机器人学领域引起了越来越多的关注。然而,传统的透镜设计对这些传感器施加了最小厚度限制,从而在空间受限的应用场景中应用受到限制。本文提出了ThinTact,这是一种新型无镜头的基于视觉的触觉传感器,其传感区域超过200平方毫米,厚度小于10毫米。它采用掩模无镜头成像技术将接触信息映射为CMOS信号。为了确保实时触觉感知,我们提出了一种实时无透镜重建算法,该算法利用基于离散余弦变换(DCT)的频域和空域联合滤波器。相比现有的优化方法,此算法实现了显著更快的计算速度。此外,为进一步提高传感质量,我们开发了一种基于通用算法及相应的系统矩阵校准的掩模优化方法。通过定性和定量实验来评估所提出的无透镜重建技术和触觉感知性能。另外,我们展示了ThinTact在包括纹理识别和接触密集型对象操作在内的多种应用中的实际适用性。 相关研究论文已发表于IEEE机器人技术汇刊(IEEE Transactions on Robotics)。视频演示链接见:[此处插入原文视频链接]。
https://arxiv.org/abs/2501.09273
This paper presents an efficient decoding approach for end-to-end automatic speech recognition (E2E-ASR) with large language models (LLMs). Although shallow fusion is the most common approach to incorporate language models into E2E-ASR decoding, we face two practical problems with LLMs. (1) LLM inference is computationally costly. (2) There may be a vocabulary mismatch between the ASR model and the LLM. To resolve this mismatch, we need to retrain the ASR model and/or the LLM, which is at best time-consuming and in many cases not feasible. We propose "delayed fusion," which applies LLM scores to ASR hypotheses with a delay during decoding and enables easier use of pre-trained LLMs in ASR tasks. This method can reduce not only the number of hypotheses scored by the LLM but also the number of LLM inference calls. It also allows re-tokenizion of ASR hypotheses during decoding if ASR and LLM employ different tokenizations. We demonstrate that delayed fusion provides improved decoding speed and accuracy compared to shallow fusion and N-best rescoring using the LibriHeavy ASR corpus and three public LLMs, OpenLLaMA 3B & 7B and Mistral 7B.
本文提出了一种针对端到端自动语音识别(E2E-ASR)与大型语言模型(LLM)结合的有效解码方法。尽管浅层融合是将语言模型融入E2E-ASR解码中最常见的方法,但在使用LLM时我们面临两个实际问题:(1) LLM的推理计算成本较高。(2) ASR模型和LLM之间可能存在词汇不匹配的情况。为了解决这一词汇不匹配的问题,我们需要重新训练ASR模型和/或LLM,这在最佳情况下也是耗时的,并且在许多情况下是不可行的。 为此,我们提出了“延迟融合”的方法,这种方法是在解码过程中以一定时间延迟的方式将LLM得分应用于ASR假设中。这样可以更方便地利用预训练好的LLM来进行ASR任务。该方法不仅减少了需要通过LLM评分的假设数量,还减少了LLM推理调用的数量。此外,如果ASR和LLM使用不同的分词策略,“延迟融合”还可以在解码过程中重新对ASR假设进行分词。 我们展示了与浅层融合及N-best重排序相比,延迟融合可以提供更快且更准确的解码结果,并通过LibriHeavy ASR语料库以及三个公开的大型语言模型(OpenLLaMA 3B & 7B 和 Mistral 7B)进行了验证。
https://arxiv.org/abs/2501.09258
Vision Transformers (ViTs) are increasingly being adopted in various sensitive vision applications - like medical diagnosis, facial recognition, etc. To improve the interpretability of such models, many approaches attempt to forward-align them with carefully annotated abstract, human-understandable semantic entities - concepts. Concepts provide global rationales to the model predictions and can be quickly understood/intervened on by domain experts. Most current research focuses on designing model-agnostic, plug-and-play generic concept-based explainability modules that do not incorporate the inner workings of foundation models (e.g., inductive biases, scale invariance, etc.) during training. To alleviate this issue for ViTs, in this paper, we propose a novel Concept Representation Alignment Module (CRAM) which learns both scale and position-aware representations from multi-scale feature pyramids and patch representations respectively. CRAM further aligns these representations with concept annotations through an attention matrix. The proposed CRAM module improves the predictive performance of ViT architectures and also provides accurate and robust concept explanations as demonstrated on five datasets - including three widely used benchmarks (CUB, Pascal APY, Concept-MNIST) and 2 real-world datasets (AWA2, KITS).
视觉变压器(ViTs)在医疗诊断、面部识别等敏感视觉应用中越来越受到青睐。为了提高这些模型的可解释性,许多方法试图通过与精心标注的抽象且人类易于理解的概念进行正向对齐来实现这一点。概念为模型预测提供了全局理由,并且领域专家可以快速理解和干预。然而,目前大多数研究集中在设计不考虑基础模型内部工作原理(如归纳偏差、尺度不变性等)的通用可插拔模块上。 为了缓解这一问题,我们针对ViTs提出了一种新颖的概念表示对齐模块(CRAM)。该模块分别从多尺度特征金字塔和patch表示中学习具有尺度感知和位置感知的表示。此外,通过注意力矩阵,CRAM进一步将这些表示与概念注释进行对齐。所提出的CRAM模块不仅提升了ViT架构的预测性能,还提供了准确且稳健的概念解释,这一效果在五个数据集上得到了验证——包括三个广泛使用的基准(CUB、Pascal APY、Concept-MNIST)以及两个真实世界的数据集(AWA2、KITS)。
https://arxiv.org/abs/2501.09221
Data augmentation (DA) is ubiquitously used in training of Automatic Speech Recognition (ASR) models. DA offers increased data variability, robustness and generalization against different acoustic distortions. Recently, personalization of ASR models on mobile devices has been shown to improve Word Error Rate (WER). This paper evaluates data augmentation in this context and proposes persoDA; a DA method driven by user's data utilized to personalize ASR. persoDA aims to augment training with data specifically tuned towards acoustic characteristics of the end-user, as opposed to standard augmentation based on Multi-Condition Training (MCT) that applies random reverberation and noises. Our evaluation with an ASR conformer-based baseline trained on Librispeech and personalized for VOICES shows that persoDA achieves a 13.9% relative WER reduction over using standard data augmentation (using random noise & reverberation). Furthermore, persoDA shows 16% to 20% faster convergence over MCT.
数据增强(Data Augmentation,简称DA)在自动语音识别(Automatic Speech Recognition,ASR)模型的训练中广泛使用。数据增强能够增加数据的多样性、鲁棒性和对抗不同声学失真的泛化能力。最近的研究表明,在移动设备上个性化ASR模型可以提高单词错误率(Word Error Rate,WER)。本文在此背景下评估了数据增强方法,并提出了persoDA——一种基于用户特定数据来个性化ASR的数据增强方法。与基于多条件训练(Multi-Condition Training,MCT)的标准随机混响和噪声数据增强不同,persoDA旨在通过专门针对最终用户的声学特征进行训练数据的扩充。 实验评估结果表明,在使用Librispeech训练的基于Conformer的基准模型并个性化应用于VOICES数据集的情况下,persoDA比标准的数据增强方法(包括随机噪音及混响)实现了13.9%相对WER的降低。此外,与MCT相比,persoDA显示出16%-20%更快的收敛速度。
https://arxiv.org/abs/2501.09113
In this paper, we take a step towards jointly modeling automatic speech recognition (STT) and speech synthesis (TTS) in a fully non-autoregressive way. We develop a novel multimodal framework capable of handling the speech and text modalities as input either individually or together. The proposed model can also be trained with unpaired speech or text data owing to its multimodal nature. We further propose an iterative refinement strategy to improve the STT and TTS performance of our model such that the partial hypothesis at the output can be fed back to the input of our model, thus iteratively improving both STT and TTS predictions. We show that our joint model can effectively perform both STT and TTS tasks, outperforming the STT-specific baseline in all tasks and performing competitively with the TTS-specific baseline across a wide range of evaluation metrics.
在这篇论文中,我们朝着同时以完全非自回归的方式建模自动语音识别(STT)和语音合成(TTS)的方向迈进了一步。我们开发了一个新颖的多模态框架,能够处理单独或结合使用的语音和文本模态输入。由于其多模态特性,所提出的模型可以使用未配对的语音或文本数据进行训练。此外,我们还提出了一种迭代精炼策略,以提高我们的模型在STT和TTS方面的性能,使得输出中的部分假设可以反馈到模型的输入中,从而通过迭代方式逐步改善两种任务的预测结果。结果显示,我们的联合模型能够有效执行STT和TTS任务,在所有任务上均优于特定于STT的基线,并且在广泛的评估指标下与特定于TTS的基线相比具有竞争力。
https://arxiv.org/abs/2501.09104
The current biodiversity loss crisis makes animal monitoring a relevant field of study. In light of this, data collected through monitoring can provide essential insights, and information for decision-making aimed at preserving global biodiversity. Despite the importance of such data, there is a notable scarcity of datasets featuring videos of birds, and none of the existing datasets offer detailed annotations of bird behaviors in video format. In response to this gap, our study introduces the first fine-grained video dataset specifically designed for bird behavior detection and species classification. This dataset addresses the need for comprehensive bird video datasets and provides detailed data on bird actions, facilitating the development of deep learning models to recognize these, similar to the advancements made in human action recognition. The proposed dataset comprises 178 videos recorded in Spanish wetlands, capturing 13 different bird species performing 7 distinct behavior classes. In addition, we also present baseline results using state of the art models on two tasks: bird behavior recognition and species classification.
当前的生物多样性丧失危机使得动物监测成为一个重要的研究领域。鉴于此,通过监测收集的数据可以提供关于保护全球生物多样性的关键见解和信息。尽管此类数据的重要性不言而喻,但目前可供使用的鸟类视频数据集却非常有限,并且现有的数据集中没有任何一个提供了详尽的鸟类行为视频注释。为了填补这一空白,我们的研究引入了首个专门用于鸟类行为检测及物种分类的细粒度视频数据集。该数据集满足了全面鸟类视频数据库的需求,并提供了关于鸟类行动的详细信息,有助于开发能够识别这些行为的深度学习模型,类似于人类动作识别领域的进展。 我们提出的这个数据集中包含了在西班牙湿地录制的178个视频片段,记录了13种不同的鸟种类执行的7类不同行为。此外,我们还使用最先进的模型,在两个任务上展示了基线结果:鸟类行为识别和物种分类。
https://arxiv.org/abs/2501.08931
Facial brightness is a key image quality factor impacting face recognition accuracy differentials across demographic groups. In this work, we aim to decrease the accuracy gap between the similarity score distributions for Caucasian and African American female mated image pairs, as measured by d' between distributions. To balance brightness across demographic groups, we conduct three experiments, interpreting brightness in the face skin region either as median pixel value or as the distribution of pixel values. Balancing based on median brightness alone yields up to a 46.8% decrease in d', while balancing based on brightness distribution yields up to a 57.6% decrease. In all three cases, the similarity scores of the individual distributions improve, with mean scores maximally improving 5.9% for Caucasian females and 3.7% for African American females.
面部亮度是影响不同人口群体间人脸识别准确率差异的关键图像质量因素。在这项工作中,我们的目标是减少白人和非裔美国女性人脸配对图像相似度得分分布之间的准确性差距,该差距通过d'(两种分布之间区分性的衡量标准)来测量。 为了在各个人口群体中平衡亮度,我们进行了三项实验,分别将面部皮肤区域的亮度解释为像素值的中位数或像素值的分布情况。仅基于中位数亮度进行调整可以最多减少46.8%的d',而基于亮度分布情况进行调整则可最多减少57.6%的d'。 在所有三种情况下,个别分布的相似度得分都有所提高,其中白人女性的平均分数最大增幅为5.9%,非裔美国女性的最大增幅为3.7%。
https://arxiv.org/abs/2501.08910
CLIP (Contrastive Language-Image Pre-training) has attained great success in pattern recognition and computer vision. Transferring CLIP to downstream tasks (e.g. zero- or few-shot classification) is a hot topic in multimodal learning. However, current studies primarily focus on either prompt learning for text or adapter tuning for vision, without fully exploiting the complementary information and correlations among image-text pairs. In this paper, we propose an Image Description Enhanced CLIP-Adapter (IDEA) method to adapt CLIP to few-shot image classification tasks. This method captures fine-grained features by leveraging both visual features and textual descriptions of images. IDEA is a training-free method for CLIP, and it can be comparable to or even exceeds state-of-the-art models on multiple tasks. Furthermore, we introduce Trainable-IDEA (T-IDEA), which extends IDEA by adding two lightweight learnable components (i.e., a projector and a learnable latent space), further enhancing the model's performance and achieving SOTA results on 11 datasets. As one important contribution, we employ the Llama model and design a comprehensive pipeline to generate textual descriptions for images of 11 datasets, resulting in a total of 1,637,795 image-text pairs, named "IMD-11". Our code and data are released at this https URL.
CLIP(对比语言图像预训练)在模式识别和计算机视觉领域取得了巨大成功。将CLIP转移到下游任务(如零样本或少样本分类)是多模态学习中的热门话题。然而,当前的研究主要集中在文本提示学习或视觉适配器微调上,未能充分挖掘图像-文本对之间的互补信息和关联性。在本文中,我们提出了一种图像描述增强的CLIP适配器(IDEA)方法,用于将CLIP适应于少样本图像分类任务。该方法通过利用图像的视觉特征和文本描述来捕捉细粒度特征。IDEA是一种针对CLIP的无需训练的方法,在多个任务上可以与最先进的模型媲美甚至超过它们。 此外,我们引入了Trainable-IDEA(T-IDEA),它在IDEA的基础上增加了两个轻量级可学习组件(即投影器和可学习潜在空间),进一步提升了模型性能,并在11个数据集上实现了最先进的结果。作为一项重要贡献,我们采用了Llama模型并设计了一个综合的管道来为11个数据集上的图像生成文本描述,总共产生了1,637,795对图像-文本配对,命名为"IMD-11"。 我们的代码和数据可在以下网址获取:[https://this-url.com](请将URL替换为您实际提供的地址)。
https://arxiv.org/abs/2501.08816
In recent years, remarkable advancements in artificial intelligence-generated content (AIGC) have been achieved in the fields of image synthesis and text generation, generating content comparable to that produced by humans. However, the quality of AI-generated music has not yet reached this standard, primarily due to the challenge of effectively controlling musical emotions and ensuring high-quality outputs. This paper presents a generalized symbolic music generation framework, XMusic, which supports flexible prompts (i.e., images, videos, texts, tags, and humming) to generate emotionally controllable and high-quality symbolic music. XMusic consists of two core components, XProjector and XComposer. XProjector parses the prompts of various modalities into symbolic music elements (i.e., emotions, genres, rhythms and notes) within the projection space to generate matching music. XComposer contains a Generator and a Selector. The Generator generates emotionally controllable and melodious music based on our innovative symbolic music representation, whereas the Selector identifies high-quality symbolic music by constructing a multi-task learning scheme involving quality assessment, emotion recognition, and genre recognition tasks. In addition, we build XMIDI, a large-scale symbolic music dataset that contains 108,023 MIDI files annotated with precise emotion and genre labels. Objective and subjective evaluations show that XMusic significantly outperforms the current state-of-the-art methods with impressive music quality. Our XMusic has been awarded as one of the nine Highlights of Collectibles at WAIC 2023. The project homepage of XMusic is this https URL.
近年来,在图像合成和文本生成领域,人工智能生成的内容(AIGC)取得了显著进展,能够创造出与人类创作相媲美的内容。然而,AI生成的音乐质量尚未达到这一标准,主要是因为有效控制音乐情感并确保高质量输出存在挑战。本文提出了一种通用的符号化音乐生成框架XMusic,支持灵活提示符(例如图像、视频、文本、标签和哼唱)来生成可控情感且高质量的符号化音乐。XMusic由两个核心组件组成:XProjector和XComposer。 - XProjector将各种模式下的提示符解析为投影空间中的符号化音乐元素(即情绪、类型、节奏和音符),以生成匹配的音乐。 - XComposer包含一个生成器和一个选择器。生成器基于我们的创新性符号化音乐表示,根据用户需求生成情感可控且旋律优美的音乐;而选择器则通过构建涉及质量评估、情绪识别及类型识别任务的多任务学习方案来识别高质量的符号化音乐。 此外,我们还建立了XMIDI,这是一个大规模的符号化音乐数据集,包含108,023个带有精确情绪和风格标签的MIDI文件。客观和主观评价表明,XMusic在音乐质量方面显著优于当前最先进的方法,并且该系统因其卓越的表现而被选为WAIC 2023年度收藏品亮点之一。有关XMusic项目的主页请访问[此链接](https://xmusic.com)(注:实际的网址需要根据实际情况填写)。
https://arxiv.org/abs/2501.08809