Despite noise and caption quality having been acknowledged as important factors impacting vision-language contrastive pre-training, in this paper, we show that the full potential of improving the training process by addressing such issues is yet to be realized. Specifically, we firstly study and analyze two issues affecting training: incorrect assignment of negative pairs, and low caption quality and diversity. Then, we devise effective solutions for addressing both problems, which essentially require training with multiple true positive pairs. Finally, we propose training with sigmoid loss to address such a requirement. We show very large gains over the current state-of-the-art for both image recognition ($\sim +6\%$ on average over 11 datasets) and image retrieval ($\sim +19\%$ on Flickr30k and $\sim +15\%$ on MSCOCO).
尽管噪音和字幕质量被认为是影响视觉语言对比预训练的重要因素,但在这篇论文中,我们展示了通过解决这些问题来改进训练过程的全部潜力尚未得到实现。具体来说,我们首先研究并分析了两个影响训练的问题:错误的负对分配和低字幕质量和多样性。然后,我们为解决这两个问题制定了有效的解决方案,这本质上需要进行多组真实正例的训练。最后,我们提出了使用sigmoid损失进行训练来满足这一要求。我们证明了在图像识别(平均每11个数据集提高约6%)和图像检索(Flicker30k上的平均提高约19%,MSCOCO上的平均提高约15%)方面,当前最先进的技术都有非常大的提升。
https://arxiv.org/abs/2405.10286
In this work, our goals are two fold: large-vocabulary continuous sign language recognition (CSLR), and sign language retrieval. To this end, we introduce a multi-task Transformer model, CSLR2, that is able to ingest a signing sequence and output in a joint embedding space between signed language and spoken language text. To enable CSLR evaluation in the large-vocabulary setting, we introduce new dataset annotations that have been manually collected. These provide continuous sign-level annotations for six hours of test videos, and will be made publicly available. We demonstrate that by a careful choice of loss functions, training the model for both the CSLR and retrieval tasks is mutually beneficial in terms of performance -- retrieval improves CSLR performance by providing context, while CSLR improves retrieval with more fine-grained supervision. We further show the benefits of leveraging weak and noisy supervision from large-vocabulary datasets such as BOBSL, namely sign-level pseudo-labels, and English subtitles. Our model significantly outperforms the previous state of the art on both tasks.
在这项工作中,我们的目标有两个:大规模词汇连续手语识别(CSLR)和手语检索。为此,我们引入了一个多任务Transformer模型CSLR2,它能够将手语序列和口头语言文本联合嵌入空间中的输出。为了在大型词汇设置中实现CSLR评估,我们引入了新的数据集注释,这些注释已经手动收集。这些提供了六个小时测试视频的连续手语级注释,将公开发布。我们证明了,通过仔细选择损失函数,同时训练CSLR和检索任务,可以提高性能——检索通过提供上下文来提高CSLR性能,而CSLR通过更细粒度的监督来提高检索。我们还进一步展示了利用大型词汇数据集如BOBSL的优势,如手语级伪标签和英文字幕。我们的模型在两个任务上都显著超过了前人水平。
https://arxiv.org/abs/2405.10266
Remote sensing image-text retrieval constitutes a foundational aspect of remote sensing interpretation tasks, facilitating the alignment of vision and language representations. This paper introduces a prior instruction representation (PIR) learning paradigm that draws on prior knowledge to instruct adaptive learning of vision and text representations. Based on PIR, a domain-adapted remote sensing image-text retrieval framework PIR-ITR is designed to address semantic noise issues in vision-language understanding tasks. However, with massive additional data for pre-training the vision-language foundation model, remote sensing image-text retrieval is further developed into an open-domain retrieval task. Continuing with the above, we propose PIR-CLIP, a domain-specific CLIP-based framework for remote sensing image-text retrieval, to address semantic noise in remote sensing vision-language representations and further improve open-domain retrieval performance. In vision representation, Vision Instruction Representation (VIR) based on Spatial-PAE utilizes the prior-guided knowledge of the remote sensing scene recognition by building a belief matrix to select key features for reducing the impact of semantic noise. In text representation, Language Cycle Attention (LCA) based on Temporal-PAE uses the previous time step to cyclically activate the current time step to enhance text representation capability. A cluster-wise Affiliation Loss (AL) is proposed to constrain the inter-classes and to reduce the semantic confusion zones in the common subspace. Comprehensive experiments demonstrate that PIR could enhance vision and text representations and outperform the state-of-the-art methods of closed-domain and open-domain retrieval on two benchmark datasets, RSICD and RSITMD.
遥感图像-文本检索是遥感解释任务的基础,促进了视觉和语言表示的对齐。本文介绍了一种利用先验知识指导自适应学习视觉和文本表示的PIR学习范式。基于PIR,设计了一个适用于视觉-语言理解的域适应遥感图像-文本检索框架PIR-ITR,以解决视觉-语言理解任务中的语义噪声问题。然而,在预先训练视觉-语言基础模型时添加大量数据后,遥感图像-文本检索进一步发展成为开放域检索任务。继续上述,我们提出了PIR-CLIP,一个针对遥感图像-文本检索的域特定CLIP框架,以解决遥感视觉-语言表示中的语义噪声,进一步提高开放域检索性能。在视觉表示中,基于空间-PAE的视觉指令表示(VIR)利用先验指导下的遥感场景识别知识构建信念矩阵,以选择关键特征来降低语义噪声的影响。在文本表示中,基于Temporal-PAE的语义循环关注(LCA)利用先验指导下的前一个时间步循环激活当前时间步,以增强文本表示能力。我们提出了一种聚类局域关联损失(AL)来约束跨类别关系,并减小共轭空间的语义混淆区域。全面的实验证明,PIR可以增强视觉和文本表示,并在两个基准数据集RSICD和RSITMD上优于最先进的关闭域和开放域检索方法。
https://arxiv.org/abs/2405.10160
Natural language could play an important role in developing generalist surgical models by providing a broad source of supervision from raw texts. This flexible form of supervision can enable the model's transferability across datasets and tasks as natural language can be used to reference learned visual concepts or describe new ones. In this work, we present HecVL, a novel hierarchical video-language pretraining approach for building a generalist surgical model. Specifically, we construct a hierarchical video-text paired dataset by pairing the surgical lecture video with three hierarchical levels of texts: at clip-level, atomic actions using transcribed audio texts; at phase-level, conceptual text summaries; and at video-level, overall abstract text of the surgical procedure. Then, we propose a novel fine-to-coarse contrastive learning framework that learns separate embedding spaces for the three video-text hierarchies using a single model. By disentangling embedding spaces of different hierarchical levels, the learned multi-modal representations encode short-term and long-term surgical concepts in the same model. Thanks to the injected textual semantics, we demonstrate that the HecVL approach can enable zero-shot surgical phase recognition without any human annotation. Furthermore, we show that the same HecVL model for surgical phase recognition can be transferred across different surgical procedures and medical centers.
自然语言在开发通用手术模型方面发挥了重要作用,因为它可以提供从原始文本的广泛监督。这种灵活的监督形式可以确保模型在数据和任务上的可转移性,因为自然语言可以用于参考学到的视觉概念或描述新的概念。在这项工作中,我们提出了HecVL,一种用于构建通用手术模型的分层视频语言预训练方法。具体来说,我们通过将手术讲座视频与三个层次的文本(剪辑级别、音频转录文本级别和视频级别)进行配对,构建了一个分层的视频文本对数据集。然后,我们提出了一个新颖的细到粗的对比学习框架,使用单个模型学习三个视频文本层次的嵌入空间。通过分离不同层次的嵌入空间,学习到的多模态表示编码了同一模型中的短期和长期手术概念。由于引入了文本语义,我们证明了HecVL方法可以在没有任何人类注释的情况下实现零散手术阶段识别。此外,我们还证明了用于手术阶段识别的HecVL模型可以应用于不同的手术过程和医疗机构。
https://arxiv.org/abs/2405.10075
Event Stream Super-Resolution (ESR) aims to address the challenge of insufficient spatial resolution in event streams, which holds great significance for the application of event cameras in complex scenarios. Previous works for ESR often process positive and negative events in a mixed paradigm. This paradigm limits their ability to effectively model the unique characteristics of each event and mutually refine each other by considering their correlations. In this paper, we propose a bilateral event mining and complementary network (BMCNet) to fully leverage the potential of each event and capture the shared information to complement each other simultaneously. Specifically, we resort to a two-stream network to accomplish comprehensive mining of each type of events individually. To facilitate the exchange of information between two streams, we propose a bilateral information exchange (BIE) module. This module is layer-wisely embedded between two streams, enabling the effective propagation of hierarchical global information while alleviating the impact of invalid information brought by inherent characteristics of events. The experimental results demonstrate that our approach outperforms the previous state-of-the-art methods in ESR, achieving performance improvements of over 11\% on both real and synthetic datasets. Moreover, our method significantly enhances the performance of event-based downstream tasks such as object recognition and video reconstruction. Our code is available at this https URL.
事件流超分辨率(ESR)旨在解决事件流中空间分辨率不足的问题,这对在复杂场景中应用事件相机具有重大意义。之前的ESR工作通常在混合范式中处理正负事件。这种范式限制了他们有效建模每个事件的独特特点以及相互 refinement 彼此的能力。在本文中,我们提出了一种双边事件挖掘和互补网络(BMCNet),以充分利用每个事件的潜力,同时捕捉到相互补充的信息。具体来说,我们采用双流网络分别对每种事件进行全面的挖掘。为了促进两个流之间的信息交流,我们提出了双向信息交换(BIE)模块。该模块在两个流之间层叠嵌入,有效地传播分层全局信息,同时减轻由于事件固有特征带来的不准确信息的影响。实验结果表明,我们的方法在ESR领域超过了最先进的现有方法,实现了超过11%的性能提升,无论是真实数据还是合成数据。此外,我们的方法显著增强了基于事件的下游任务(如物体识别和视频重建)的性能。我们的代码可在此处访问:https://www.example.com/。
https://arxiv.org/abs/2405.10037
Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which aims to predict the ground-truth transcription from the decoded N-best hypotheses. Thanks to the strong language generation ability of LLMs and rich information in the N-best list, GER shows great effectiveness in enhancing ASR results. However, it still suffers from two limitations: 1) LLMs are unaware of the source speech during GER, which may lead to results that are grammatically correct but violate the source speech content, 2) N-best hypotheses usually only vary in a few tokens, making it redundant to send all of them for GER, which could confuse LLM about which tokens to focus on and thus lead to increased miscorrection. In this paper, we propose ClozeGER, a new paradigm for ASR generative error correction. First, we introduce a multimodal LLM (i.e., SpeechGPT) to receive source speech as extra input to improve the fidelity of correction output. Then, we reformat GER as a cloze test with logits calibration to remove the input information redundancy and simplify GER with clear instructions. Experiments show that ClozeGER achieves a new breakthrough over vanilla GER on 9 popular ASR datasets.
近年来,大型语言模型(LLMs)的进步推动了自动语音识别(ASR)中的生成误差纠正(GER)的发展,该旨在从解码的N个最佳假设预测听到的地面真实转录。得益于LLMs强大的语言生成能力以及N个最佳列表中的丰富信息,GER在增强ASR效果方面表现出巨大的效果。然而,它仍然存在两个局限性:1)LLMs在GER过程中无法感知原始语音,这可能导致语法正确但违反源语音内容的成果;2)N个最佳假设通常只在几个词上变化,这使得为GER发送所有它们变得冗余,可能会使LLM困惑于应关注哪些词,从而导致增加误译。在本文中,我们提出了ClozeGER,一种新的ASR生成误差纠正范式。首先,我们引入了一个多模态LLM(即SpeechGPT)以接收原始语音作为额外的输入以提高纠错输出的保真度。然后,我们将GER重新格式化为一个cloze测试,对logits进行归一化以消除输入信息冗余并简化GER,并提供明确的指导。实验证明,ClozeGER在9个流行的ASR数据集上取得了与普通GER的新突破。
https://arxiv.org/abs/2405.10025
Most existing attention prediction research focuses on salient instances like humans and objects. However, the more complex interaction-oriented attention, arising from the comprehension of interactions between instances by human observers, remains largely unexplored. This is equally crucial for advancing human-machine interaction and human-centered artificial intelligence. To bridge this gap, we first collect a novel gaze fixation dataset named IG, comprising 530,000 fixation points across 740 diverse interaction categories, capturing visual attention during human observers cognitive processes of interactions. Subsequently, we introduce the zero-shot interaction-oriented attention prediction task ZeroIA, which challenges models to predict visual cues for interactions not encountered during training. Thirdly, we present the Interactive Attention model IA, designed to emulate human observers cognitive processes to tackle the ZeroIA problem. Extensive experiments demonstrate that the proposed IA outperforms other state-of-the-art approaches in both ZeroIA and fully supervised settings. Lastly, we endeavor to apply interaction-oriented attention to the interaction recognition task itself. Further experimental results demonstrate the promising potential to enhance the performance and interpretability of existing state-of-the-art HOI models by incorporating real human attention data from IG and attention labels generated by IA.
目前,大部分现有的注意力预测研究都集中在显眼的实例,如人类和物体。然而,更加复杂的关系型注意力,即通过人类观察者理解实例之间互动所产生的注意力,仍然没有被深入研究。这对于促进人与机器之间的交互和人类为中心的人工智能发展至关重要。为了填补这一空白,我们首先收集了一个名为IG的新 gaze 固定点数据集,包括740个不同交互类别的530,000个固定点,记录了人类观察者在互动过程中的视觉注意力。接着,我们引入了零击关系型注意力预测任务ZeroIA,该任务挑战模型预测在训练过程中未见过的视觉线索。第三,我们提出了交互注意力模型IA,旨在模仿人类观察者的认知过程解决 ZeroIA 问题。大量实验证明,与最先进的零击和完全监督方法相比,所提出的IA在ZeroIA和完全监督设置中都表现出色。最后,我们努力将关系型注意力应用于交互识别任务本身。进一步的实验结果表明,通过将IG和IA生成的真实人类注意力数据以及注意力标签相结合,可以增强现有最先进的HOI模型的性能和可解释性。
https://arxiv.org/abs/2405.09931
In this paper, we review the NTIRE 2024 challenge on Restore Any Image Model (RAIM) in the Wild. The RAIM challenge constructed a benchmark for image restoration in the wild, including real-world images with/without reference ground truth in various scenarios from real applications. The participants were required to restore the real-captured images from complex and unknown degradation, where generative perceptual quality and fidelity are desired in the restoration result. The challenge consisted of two tasks. Task one employed real referenced data pairs, where quantitative evaluation is available. Task two used unpaired images, and a comprehensive user study was conducted. The challenge attracted more than 200 registrations, where 39 of them submitted results with more than 400 submissions. Top-ranked methods improved the state-of-the-art restoration performance and obtained unanimous recognition from all 18 judges. The proposed datasets are available at this https URL and the homepage of this challenge is at this https URL.
在本文中,我们对在野中的NTIRE 2024挑战进行了回顾。RAIM挑战为图像修复领域树立了一个基准,包括各种真实世界图像,这些图像带有/不带参考真实世界地面真实情况下的图像修复情况。参与者被要求从复杂和未知退化中恢复真实捕获的图像,要求在修复结果中实现生成感知质量和一致性。挑战包括两个任务。任务一是使用真实参考数据对,有定量的评估可用。任务二是使用未配对图像,并进行了全面的用户研究。挑战吸引了超过200个注册,其中39个提交了超过400个结果。排名前几的方法提高了最先进的修复性能,并得到了所有18个评委的一致认可。提出的数据集可在此处访问:<https://url> 和<https://url>。
https://arxiv.org/abs/2405.09923
With the rapid development of face recognition (FR) systems, the privacy of face images on social media is facing severe challenges due to the abuse of unauthorized FR systems. Some studies utilize adversarial attack techniques to defend against malicious FR systems by generating adversarial examples. However, the generated adversarial examples, i.e., the protected face images, tend to suffer from subpar visual quality and low transferability. In this paper, we propose a novel face protection approach, dubbed DiffAM, which leverages the powerful generative ability of diffusion models to generate high-quality protected face images with adversarial makeup transferred from reference images. To be specific, we first introduce a makeup removal module to generate non-makeup images utilizing a fine-tuned diffusion model with guidance of textual prompts in CLIP space. As the inverse process of makeup transfer, makeup removal can make it easier to establish the deterministic relationship between makeup domain and non-makeup domain regardless of elaborate text prompts. Then, with this relationship, a CLIP-based makeup loss along with an ensemble attack strategy is introduced to jointly guide the direction of adversarial makeup domain, achieving the generation of protected face images with natural-looking makeup and high black-box transferability. Extensive experiments demonstrate that DiffAM achieves higher visual quality and attack success rates with a gain of 12.98% under black-box setting compared with the state of the arts. The code will be available at this https URL.
随着面部识别(FR)系统的发展迅速,由于未经授权的FR系统的滥用,社交媒体上面部图像的隐私面临着严重的挑战。一些研究表明,通过使用对抗性攻击技术来防御恶意FR系统,生成对抗性样本。然而,生成的对抗性样本,即受保护的面部图像,往往视觉质量较差,且不具有很好的可转移性。在本文中,我们提出了一个名为DiffAM的新面部保护方法,它利用扩散模型的强大生成能力,通过从参考图像中转移的对抗性化妆生成高质量的保护面部图像。具体来说,我们首先引入了一个化妆去除模块,利用经过微调的扩散模型生成非化妆图像,通过文本提示在CLIP空间中指导。化妆去除是对化妆领域和非化妆领域之间的确定性关系的一种反向过程。然后,基于这一关系,我们引入了一种CLIP-based的化妆损失和集成攻击策略,共同引导对抗性化妆领域,实现具有自然妆容和高黑盒传输性的保护面部图像的生成。大量实验证明,DiffAM在黑盒设置下实现了12.98%的视觉质量和攻击成功率,与现有水平相当。代码将在此处公开可用:https://www.thisurl.com。
https://arxiv.org/abs/2405.09882
3D face registration is an important process in which a 3D face model is aligned and mapped to a template face. However, the task of 3D face registration becomes particularly challenging when dealing with partial face data, where only limited facial information is available. To address this challenge, this paper presents a novel deep learning-based approach that combines quasi-conformal geometry with deep neural networks for partial face registration. The proposed framework begins with a Landmark Detection Network that utilizes curvature information to detect the presence of facial features and estimate their corresponding coordinates. These facial landmark features serve as essential guidance for the registration process. To establish a dense correspondence between the partial face and the template surface, a registration network based on quasiconformal theories is employed. The registration network establishes a bijective quasiconformal surface mapping aligning corresponding partial faces based on detected landmarks and curvature values. It consists of the Coefficients Prediction Network, which outputs the optimal Beltrami coefficient representing the surface mapping. The Beltrami coefficient quantifies the local geometric distortion of the mapping. By controlling the magnitude of the Beltrami coefficient through a suitable activation function, the bijectivity and geometric distortion of the mapping can be controlled. The Beltrami coefficient is then fed into the Beltrami solver network to reconstruct the corresponding mapping. The surface registration enables the acquisition of corresponding regions and the establishment of point-wise correspondence between different partial faces, facilitating precise shape comparison through the evaluation of point-wise geometric differences at these corresponding regions. Experimental results demonstrate the effectiveness of the proposed method.
3D面部配准是将3D面部模型与模板面部对齐并映射的过程。然而,在处理部分面部数据时,3D面部配准变得特别具有挑战性,因为在这种情况下,面部信息有限。为了应对这一挑战,本文提出了一种基于深度学习的全新方法,将准同构几何与深度神经网络相结合用于部分面部配准。该框架首先使用特征检测网络利用曲率信息来检测面部特征并估计其对应坐标。这些面部特征用于指导配准过程。为了建立部分面部与模板表面之间的密集对应关系,采用基于准同构理论的注册网络。该注册网络基于检测到的特征点和曲率值建立双射的准同构表面映射。它由系数预测网络组成,该网络输出表面映射的最优Beltrami系数。Beltrami系数衡量映射的局部几何变形。通过通过适当的激活函数控制Beltrami系数的幅度,可以控制映射的准同构性和几何变形。然后将Beltrami系数输入Beltrami求解网络以重构相应的映射。表面配准使相应的区域获得获取,不同部分面部的点对之间建立点对点关系,从而通过评估这些相应区域中的点对几何差异来精确形状比较。实验结果证明了所提出方法的有效性。
https://arxiv.org/abs/2405.09880
Spoken language interaction is at the heart of interpersonal communication, and people flexibly adapt their speech to different individuals and environments. It is surprising that robots, and by extension other digital devices, are not equipped to adapt their speech and instead rely on fixed speech parameters, which often hinder comprehension by the user. We conducted a speech comprehension study involving 39 participants who were exposed to different environmental and contextual conditions. During the experiment, the robot articulated words using different vocal parameters, and the participants were tasked with both recognising the spoken words and rating their subjective impression of the robot's speech. The experiment's primary outcome shows that spaces with good acoustic quality positively correlate with intelligibility and user experience. However, increasing the distance between the user and the robot exacerbated the user experience, while distracting background sounds significantly reduced speech recognition accuracy and user satisfaction. We next built an adaptive voice for the robot. For this, the robot needs to know how difficult it is for a user to understand spoken language in a particular setting. We present a prediction model that rates how annoying the ambient acoustic environment is and, consequentially, how hard it is to understand someone in this setting. Then, we develop a convolutional neural network model to adapt the robot's speech parameters to different users and spaces, while taking into account the influence of ambient acoustics on intelligibility. Finally, we present an evaluation with 27 users, demonstrating superior intelligibility and user experience with adaptive voice parameters compared to fixed voice.
口语交互是人际交往的核心,而人们会根据不同的个体和环境灵活调整自己的讲话。令人惊讶的是,机器人以及其他数字设备并没有具备适应讲话的能力,而是依赖固定的讲话参数,这往往阻碍了用户的理解。我们对39名参与者进行了一项口语理解研究,让他们暴露于不同的环境和情境中。在实验过程中,机器人使用不同的语音参数表达单词,参与者被要求识别出听到的单词,并对机器人的讲话进行主观评价。实验的主要结果表明,具有良好的声学质量的空间与可理解性和用户体验正相关。然而,用户与机器人之间的距离增加会加剧用户体验,而分散的背景声音会显著降低语音识别准确性和用户满意度。接下来,我们为机器人构建了一个自适应的语音。为此,机器人需要知道用户在特定环境中理解口语语言的困难程度。我们提出了一种预测模型,用于评估环境声学对可理解性的影响程度,从而对机器人的讲话参数进行调整。最后,我们展示了使用自适应语音参数的评估结果,证明了与固定语音相比,具有更好的智能度和用户体验。
https://arxiv.org/abs/2405.09708
We introduce ParaNames, a massively multilingual parallel name resource consisting of 140 million names spanning over 400 languages. Names are provided for 16.8 million entities, and each entity is mapped from a complex type hierarchy to a standard type (PER/LOC/ORG). Using Wikidata as a source, we create the largest resource of this type to date. We describe our approach to filtering and standardizing the data to provide the best quality possible. ParaNames is useful for multilingual language processing, both in defining tasks for name translation/transliteration and as supplementary data for tasks such as named entity recognition and linking. We demonstrate the usefulness of ParaNames on two tasks. First, we perform canonical name translation between English and 17 other languages. Second, we use it as a gazetteer for multilingual named entity recognition, obtaining performance improvements on all 10 languages evaluated.
我们介绍了一个名为ParaNames的多语言并行命名资源,由140 million个跨越400多种语言的名字组成。为16,800,000个实体提供了名字,并且每个实体都从复杂的类型层次结构映射到标准类型(PER/LOC/ORG)。作为Wikidata的来源,我们创建了有史以来最大的此类资源。我们描述了我们对数据进行过滤和标准化的方法,以提供最佳质量。ParaNames对于多语言语言处理都非常有用,不仅在定义名称翻译/转写任务中,还作为命名实体识别和链接等任务的补充数据。我们在两个任务中展示了ParaNames的实用性。首先,在英语和其他17种语言之间进行规范化的命名翻译。其次,将其用作多语言命名实体识别的目录,在所有评估语言上都实现了性能提升。
https://arxiv.org/abs/2405.09496
In light of the widespread application of Automatic Speech Recognition (ASR) systems, their security concerns have received much more attention than ever before, primarily due to the susceptibility of Deep Neural Networks. Previous studies have illustrated that surreptitiously crafting adversarial perturbations enables the manipulation of speech recognition systems, resulting in the production of malicious commands. These attack methods mostly require adding noise perturbations under $\ell_p$ norm constraints, inevitably leaving behind artifacts of manual modifications. Recent research has alleviated this limitation by manipulating style vectors to synthesize adversarial examples based on Text-to-Speech (TTS) synthesis audio. However, style modifications based on optimization objectives significantly reduce the controllability and editability of audio styles. In this paper, we propose an attack on ASR systems based on user-customized style transfer. We first test the effect of Style Transfer Attack (STA) which combines style transfer and adversarial attack in sequential order. And then, as an improvement, we propose an iterative Style Code Attack (SCA) to maintain audio quality. Experimental results show that our method can meet the need for user-customized styles and achieve a success rate of 82% in attacks, while keeping sound naturalness due to our user study.
鉴于自动语音识别(ASR)系统的广泛应用,其安全性问题比以往任何时候都受到更多的关注,主要原因是深度神经网络的易受性。以前的研究表明,在约束条件下悄悄地生成对抗扰动能够操纵语音识别系统,从而产生恶意命令。这些攻击方法主要需要在$\ell_p$范数约束下添加噪声扰动,不可避免地留下了手动修改的残影。最近的研究通过将文本到语音(TTS)合成音频的对抗样本,缓解了这一限制。然而,基于优化目标的风格修改会显著降低音频风格的可控性和可编辑性。在本文中,我们提出了基于用户自定义风格迁移的ASR系统攻击。我们首先测试了顺序风格迁移攻击(STA)的效果。然后,作为改进,我们提出了一个迭代式风格码攻击(SCA)来保持音频质量。实验结果表明,我们的方法可以满足用户自定义风格的需求,攻击成功率为82%,同时保持声音的自然度。
https://arxiv.org/abs/2405.09470
A fundamental tenet of pattern recognition is that overlap between training and testing sets causes an optimistic accuracy estimate. Deep CNNs for face recognition are trained for N-way classification of the identities in the training set. Accuracy is commonly estimated as average 10-fold classification accuracy on image pairs from test sets such as LFW, CALFW, CPLFW, CFP-FP and AgeDB-30. Because train and test sets have been independently assembled, images and identities in any given test set may also be present in any given training set. In particular, our experiments reveal a surprising degree of identity and image overlap between the LFW family of test sets and the MS1MV2 training set. Our experiments also reveal identity label noise in MS1MV2. We compare accuracy achieved with same-size MS1MV2 subsets that are identity-disjoint and not identity-disjoint with LFW, to reveal the size of the optimistic bias. Using more challenging test sets from the LFW family, we find that the size of the optimistic bias is larger for more challenging test sets. Our results highlight the lack of and the need for identity-disjoint train and test methodology in face recognition research.
模式识别的一个基本信念是,训练集和测试集之间的覆盖会导致乐观估计的准确性估计。用于面部识别的深度卷积神经网络通过N路分类对训练集中的身份进行训练。通常将准确性估计为像LFW、CALFW、CPLFW、CFP-FP和AgeDB-30等测试集中的图像对的平均10倍分类准确度。因为训练和测试集是独立组装的,所以每个测试集中的图像和身份可能在任何训练集中找到。特别是,我们的实验揭示了LFW家族测试集和MS1MV2训练集之间身份和图像重叠的令人惊讶的程度。我们的实验还揭示了MS1MV2中身份标签噪声。我们比较了与LFW相同大小的子集获得的准确度,这些子集不与LFW具有相同的身份,以揭示乐观估计的规模。使用LFW家族更具有挑战性的测试集,我们发现,对于具有更大挑战性的测试集,乐观估计的规模更大。我们的结果突出了在面部识别研究中缺乏身份-不相同的训练和测试方法的问题,并强调了需要改进这个问题。
https://arxiv.org/abs/2405.09403
Synthetic aperture radar (SAR) is essential in actively acquiring information for Earth observation. SAR Automatic Target Recognition (ATR) focuses on detecting and classifying various target categories under different image conditions. The current deep learning-based SAR ATR methods are typically designed for specific datasets and applications. Various target characteristics, scene background information, and sensor parameters across ATR datasets challenge the generalization of those methods. This paper aims to achieve general SAR ATR based on a foundation model with Self-Supervised Learning (SSL). Our motivation is to break through the specific dataset and condition limitations and obtain universal perceptual capabilities across the target, scene, and sensor. A foundation model named SARATR-X is proposed with the following four aspects: pre-training dataset, model backbone, SSL, and evaluation task. First, we integrated 14 datasets with various target categories and imaging conditions as a pre-training dataset. Second, different model backbones were discussed to find the most suitable approaches for remote-sensing images. Third, we applied two-stage training and SAR gradient features to ensure the diversity and scalability of SARATR-X. Finally, SARATR-X has achieved competitive and superior performance on 5 datasets with 8 task settings, which shows that the foundation model can achieve universal SAR ATR. We believe it is time to embrace fundamental models for SAR image interpretation in the era of increasing big data.
合成孔雷达(SAR)在积极获取地球观测信息方面至关重要。SAR自动目标识别(ATR)关注于在不同的图像条件下检测和分类各种目标类别。目前基于深度学习的SAR ATR方法通常是为特定数据集和应用设计的。各种目标特征、场景背景信息和ATR数据集中的传感器参数挑战了这些方法的一般化。本文旨在基于自监督学习(SSL)的基础模型实现通用SAR ATR。我们的目标是突破特定数据和条件的限制,获得目标、场景和传感器之间的普遍感知能力。 我们提出了一个名为SARATR-X的基础模型,包括以下四个方面:预训练数据集、模型骨架、SSL和评估任务。首先,我们将14个数据集与各种目标和成像条件集成作为一个预训练数据集。其次,讨论了不同的模型骨架,以找到最适合远程感测图像的适当方法。第三,我们应用了两阶段培训和SAR梯度特征来确保SARATR-X的多样性和可扩展性。最后,SARATR-X在5个数据集和8个任务设置上实现了竞争性和卓越性能,这表明基础模型可以实现通用SAR ATR。我们认为,在数据和数据量不断增加的时代,应该拥抱基本模型用于SAR图像解释。
https://arxiv.org/abs/2405.09365
Localizing oneself during endoscopic procedures can be problematic due to the lack of distinguishable textures and landmarks, as well as difficulties due to the endoscopic device such as a limited field of view and challenging lighting conditions. Expert knowledge shaped by years of experience is required for localization within the human body during endoscopic procedures. In this work, we present a deep learning method based on anatomy recognition, that constructs a surgical path in an unsupervised manner from surgical videos, modelling relative location and variations due to different viewing angles. At inference time, the model can map an unseen video's frames on the path and estimate the viewing angle, aiming to provide guidance, for instance, to reach a particular destination. We test the method on a dataset consisting of surgical videos of transsphenoidal adenomectomies, as well as on a synthetic dataset. An online tool that lets researchers upload their surgical videos to obtain anatomy detections and the weights of the trained YOLOv7 model are available at: this https URL.
在内镜手术过程中,将自身定位可能会出现问题,因为内镜设备(如有限视野和复杂的照明条件)以及由于缺乏可区分纹理和标志而产生的困难。在本文中,我们提出了一种基于解剖识别的深度学习方法,在未经监督的情况下从手术视频中构建手术路径,并建模不同视角下的相对位置和变化。在推理时,该模型可以在路径上映射未见过的视频帧,并估计视角,旨在提供指导,例如,到达特定目的地。我们在包括Transsphenoidal腺瘤在内的大规模手术视频数据集上测试了该方法,以及在合成数据集上进行了测试。在这个网站上有这样一个在线工具,让研究人员上传他们的手术视频以获得解剖检测和训练的YOLOv7模型的权重:https:// this URL.
https://arxiv.org/abs/2405.09355
The graph neural networks has been proved to be an efficient machine learning technique in real life applications. The handwritten recognition is one of the useful area in real life use where both offline and online handwriting recognition are required. The chain code as feature extraction technique has shown significant results in literature and we have been able to use chain codes with graph neural networks. To the best of our knowledge, this work presents first time a novel combination of handwritten trajectories features as chain codes and graph neural networks together. The handwritten trajectories for offline handwritten text has been evaluated using recovery of drawing order, whereas online handwritten trajectories are directly used with chain codes. Our results prove that present combination surpass previous results and minimize error rate in few epochs only.
已经证明,图神经网络在现实生活中应用是有效的机器学习技术。手写识别是现实生活中的一个有用的领域,需要同时进行离线和在线手写识别。作为特征提取技术,链式码在文献中已经显示出显著的成果,我们能够使用图神经网络与链式码一起工作。据我们所知,这项工作首次将手写轨迹特征与链式码和图神经网络相结合,形成了一种新的组合。我们使用恢复绘制顺序来评估手写在线文本的手写轨迹,而在线手写轨迹则直接使用链式码。我们的结果证明,这种结合超出了以前的结果,并且在几轮训练后仅能最小化误差率。
https://arxiv.org/abs/2405.09247
Due to the increasing need for effective security measures and the integration of cameras in commercial products, a hugeamount of visual data is created today. Law enforcement agencies (LEAs) are inspecting images and videos to findradicalization, propaganda for terrorist organizations and illegal products on darknet markets. This is time consuming.Instead of an undirected search, LEAs would like to adapt to new crimes and threats, and focus only on data from specificlocations, persons or objects, which requires flexible interpretation of image content. Visual concept detection with deepconvolutional neural networks (CNNs) is a crucial component to understand the image content. This paper has fivecontributions. The first contribution allows image-based geo-localization to estimate the origin of an image. CNNs andgeotagged images are used to create a model that determines the location of an image by its pixel values. The secondcontribution enables analysis of fine-grained concepts to distinguish sub-categories in a generic concept. The proposedmethod encompasses data acquisition and cleaning and concept hierarchies. The third contribution is the recognition ofperson attributes (e.g., glasses or moustache) to enable query by textual description for a person. The person-attributeproblem is treated as a specific sub-task of concept classification. The fourth contribution is an intuitive image annotationtool based on active learning. Active learning allows users to define novel concepts flexibly and train CNNs with minimalannotation effort. The fifth contribution increases the flexibility for LEAs in the query definition by using query expansion.Query expansion maps user queries to known and detectable concepts. Therefore, no prior knowledge of the detectableconcepts is required for the users. The methods are validated on data with varying locations (popular and non-touristiclocations), varying person attributes (CelebA dataset), and varying number of annotations.
由于对有效安全措施的需求不断增加以及摄像头在商业产品中的应用,如今产生了大量的视觉数据。执法机构(LEAs)正在检查图像和视频以寻找极端化、恐怖主义组织和非法商品在暗网市场上的传播。这需要耗费大量时间。 instead of an undirected search, LEAs would like to adapt to new crimes and threats, and focus only on data from specific locations, persons or objects, which requires flexible interpretation of image content. Visual concept detection with deep convolutional neural networks (CNNs) is a crucial component to understand the image content. This paper has five contributions. The first contribution allows image-based geolocation to estimate the origin of an image. CNNs and geotagged images are used to create a model that determines the location of an image by its pixel values. The second contribution enables analysis of fine-grained concepts to distinguish sub-categories in a generic concept. The proposed method encompasses data acquisition and cleaning and concept hierarchies. The third contribution is the recognition of person attributes (e.g., glasses or moustache) to enable query by textual description for a person. The person-attribute problem is treated as a specific sub-task of concept classification. The fourth contribution is an intuitive image annotation tool based on active learning. Active learning allows users to define novel concepts flexibly and train CNNs with minimal annotation effort. The fifth contribution increases the flexibility for LEAs in the query definition by using query expansion. Query expansion maps user queries to known and detectable concepts. Therefore, no prior knowledge of the detectable concepts is required for the users. The methods are validated on data with varying locations (popular and non-tourist locations), varying person attributes (CelebA dataset), and varying number of annotations.
https://arxiv.org/abs/2405.09194
As image recognition models become more prevalent, scalable coding methods for machines and humans gain more importance. Applications of image recognition models include traffic monitoring and farm management. In these use cases, the scalable coding method proves effective because the tasks require occasional image checking by humans. Existing image compression methods for humans and machines meet these requirements to some extent. However, these compression methods are effective solely for specific image recognition models. We propose a learning-based scalable image coding method for humans and machines that is compatible with numerous image recognition models. We combine an image compression model for machines with a compression model, providing additional information to facilitate image decoding for humans. The features in these compression models are fused using a feature fusion network to achieve efficient image compression. Our method's additional information compression model is adjusted to reduce the number of parameters by enabling combinations of features of different sizes in the feature fusion network. Our approach confirms that the feature fusion network efficiently combines image compression models while reducing the number of parameters. Furthermore, we demonstrate the effectiveness of the proposed scalable coding method by evaluating the image compression performance in terms of decoded image quality and bitrate.
随着图像识别模型越来越普遍,可扩展的机器和人类代码方法变得更加重要。图像识别模型的应用包括交通监测和农场管理。在这些应用中,可扩展的编码方法证明有效,因为这些任务需要人类偶尔检查图像。现有的人和机器图像压缩方法在一定程度上满足了这些要求。然而,这些压缩方法仅对特定的图像识别模型有效。我们提出了一种基于学习的可扩展图像编码方法,既适用于人类,也适用于机器,与众多图像识别模型兼容。我们将机器图像压缩模型与压缩模型相结合,为人类提供额外的信息以促进图像解码。这些压缩模型的特征使用特征融合网络进行融合,实现高效的图像压缩。我们调整了这种方法的信息压缩模型,通过允许在特征融合网络中组合不同大小的特征,从而减少参数数量。通过评估图像压缩性能,即解码图像的质量和对位率,我们证明了所提出的可扩展编码方法的有效性。
https://arxiv.org/abs/2405.09152
Gait recognition, a rapidly advancing vision technology for person identification from a distance, has made significant strides in indoor settings. However, evidence suggests that existing methods often yield unsatisfactory results when applied to newly released real-world gait datasets. Furthermore, conclusions drawn from indoor gait datasets may not easily generalize to outdoor ones. Therefore, the primary goal of this work is to present a comprehensive benchmark study aimed at improving practicality rather than solely focusing on enhancing performance. To this end, we first develop OpenGait, a flexible and efficient gait recognition platform. Using OpenGait as a foundation, we conduct in-depth ablation experiments to revisit recent developments in gait recognition. Surprisingly, we detect some imperfect parts of certain prior methods thereby resulting in several critical yet undiscovered insights. Inspired by these findings, we develop three structurally simple yet empirically powerful and practically robust baseline models, i.e., DeepGaitV2, SkeletonGait, and SkeletonGait++, respectively representing the appearance-based, model-based, and multi-modal methodology for gait pattern description. Beyond achieving SoTA performances, more importantly, our careful exploration sheds new light on the modeling experience of deep gait models, the representational capacity of typical gait modalities, and so on. We hope this work can inspire further research and application of gait recognition towards better practicality. The code is available at this https URL.
翻译: 滑步识别,一种用于从距离识别人员的人脸识别技术,已经在室内环境中取得了显著的进步。然而,证据表明,将现有的方法应用于刚发布的现实世界滑步数据时,往往会产生不满意的结果。此外,从室内滑步数据中得出的结论可能不容易应用于户外数据。因此,本工作的主要目标是为提高实用性而不是仅仅关注性能。为此,我们首先开发了OpenGait,一个灵活且高效的滑步识别平台。作为OpenGait的基础,我们进行了深入的消融实验,回顾了滑步识别的最新发展。令人惊讶的是,我们检测到某些先前的方法中的一些不完美之处,从而得出了几个关键但尚未被发现的见解。为了这些发现,我们分别开发了三种结构简单但具有实证 powerful 的基准模型,即DeepGaitV2,SkeletonGait和SkeletonGait++,分别代表基于外观、基于模型的和多模态的滑步模式描述方法。除了实现SoTA性能外,更重要的是,我们的仔细探索揭示了深层滑步模型的建模经验、典型滑步模度的表示能力等。我们希望这项工作能够鼓舞进一步研究并应用滑步识别技术以实现更好的实用性。代码可在此处下载:https://url.cn/xyz
https://arxiv.org/abs/2405.09138