We introduce a novel method for joint expression and audio-guided talking face generation. Recent approaches either struggle to preserve the speaker identity or fail to produce faithful facial expressions. To address these challenges, we propose a NeRF-based network. Since we train our network on monocular videos without any ground truth, it is essential to learn disentangled representations for audio and expression. We first learn audio features in a self-supervised manner, given utterances from multiple subjects. By incorporating a contrastive learning technique, we ensure that the learned audio features are aligned to the lip motion and disentangled from the muscle motion of the rest of the face. We then devise a transformer-based architecture that learns expression features, capturing long-range facial expressions and disentangling them from the speech-specific mouth movements. Through quantitative and qualitative evaluation, we demonstrate that our method can synthesize high-fidelity talking face videos, achieving state-of-the-art facial expression transfer along with lip synchronization to unseen audio.
我们提出了一种新的方法,用于联合表达和音频指导的谈话面部生成。最近的方法要么难以保留说话者的身份,要么无法产生准确的面部表情。为了应对这些挑战,我们提出了一个基于NeRF的网络。由于我们在单目视频上进行训练,没有 ground truth,因此学习音频和表情的分离表示至关重要。我们首先通过自监督学习学习音频特征,根据多个主题的对话进行训练。通过引入对比学习技术,我们确保学习到的音频特征与嘴唇运动对齐,并与其他面部肌肉运动分离。然后我们设计了一个Transformer架构,用于学习表情特征,捕捉长距离面部表情,并将其与特定说话者的口腔运动分离。通过定量和定性评估,我们证明了我们的方法可以合成高保真的谈话面部视频,实现与未见到的音频的同步嘴唇运动,达到最先进的面部表情转移。
https://arxiv.org/abs/2409.12156
Unified information extraction (UIE) aims to complete all information extraction tasks using a single model or framework. While previous work has primarily focused on instruction-tuning large language models (LLMs) with constructed datasets, these methods require significant computational resources and struggle to generalize to unseen tasks. To address these limitations, we propose RUIE (Retrieval-based Unified Information Extraction), a framework that leverages in-context learning to enable rapid generalization while reducing computational costs. The key challenge in RUIE is selecting the most beneficial demonstrations for LLMs to effectively handle diverse IE tasks. To achieve this, we integrate LLM preferences for ranking candidate demonstrations and design a keyword-enhanced reward model to capture fine-grained relationships between queries and demonstrations. We then train a bi-encoder retriever for UIE through contrastive learning and knowledge distillation. To the best of our knowledge, RUIE is the first trainable retrieval framework for UIE. Experimental results on 8 held-out datasets demonstrate RUIE's effectiveness in generalizing to unseen tasks, with average F1-score improvements of 19.22 and 3.13 compared to instruction-tuning methods and other retrievers, respectively. Further analysis confirms RUIE's adaptability to LLMs of varying sizes and the importance of its key components.
统一信息抽取(UIE)旨在使用单个模型或框架完成所有信息抽取任务。虽然之前的工作主要集中在使用构建数据集对大型语言模型(LLMs)进行指令微调,但这些方法需要大量的计算资源,并且很难将到新的任务上进行泛化。为了克服这些限制,我们提出了RUIE(基于检索的统一信息抽取)框架,该框架利用上下文学习来加速泛化,同时降低计算成本。RUIE的关键挑战是选择对LLM最有益的演示,有效地处理多样IE任务。为了实现这一目标,我们将LLM对候选演示的偏好集成到模型中,并设计了一个关键词增强的奖励模型,以捕捉查询和演示之间的微细关系。然后通过对比学习和对称性传播训练 bi-encoder 检索器 for UIE。据我们所知,RUIE是第一个可训练的 UIE 检索框架。在 8 个有代表性的数据集上的实验结果表明,RUIE在泛化到未见任务方面非常有效,平均 F1 分数和改进分别为 19.22 和 3.13,相对于指令微调方法和其他检索器。进一步的分析证实了RUIE对不同大小LLM的适应性以及其关键组件的重要性。
https://arxiv.org/abs/2409.11673
Generalized Category Discovery (GCD) aims to classify inputs into both known and novel categories, a task crucial for open-world scientific discoveries. However, current GCD methods are limited to unimodal data, overlooking the inherently multimodal nature of most real-world data. In this work, we extend GCD to a multimodal setting, where inputs from different modalities provide richer and complementary information. Through theoretical analysis and empirical validation, we identify that the key challenge in multimodal GCD lies in effectively aligning heterogeneous information across modalities. To address this, we propose MM-GCD, a novel framework that aligns both the feature and output spaces of different modalities using contrastive learning and distillation techniques. MM-GCD achieves new state-of-the-art performance on the UPMC-Food101 and N24News datasets, surpassing previous methods by 11.5\% and 4.7\%, respectively.
泛化类别发现(GCD)旨在将输入分为已知和新的类别,这对开放世界科学发现至关重要。然而,当前的GCD方法仅限于单模态数据,忽视了大多数现实数据固有的多模态性质。在本文中,我们将GCD扩展到多模态环境中,其中来自不同模态的输入提供更多的信息和互补信息。通过理论分析和实证验证,我们确定多模态GCD中的关键挑战在于有效地在模态之间对异质信息进行对齐。为解决这一挑战,我们提出了MM-GCD,一种使用对比学习和技术分解来对不同模态的特征和输出空间进行对齐的新框架。MM-GCD在UPMC-Food101和N24News数据集上实现了与之前方法相比分别11.5%和4.7%的突破性成绩。
https://arxiv.org/abs/2409.11624
The Forward-Forward (FF) algorithm is a recent, purely forward-mode learning method, that updates weights locally and layer-wise and supports supervised as well as unsupervised learning. These features make it ideal for applications such as brain-inspired learning, low-power hardware neural networks, and distributed learning in large models. However, while FF has shown promise on written digit recognition tasks, its performance on natural images and time-series remains a challenge. A key limitation is the need to generate high-quality negative examples for contrastive learning, especially in unsupervised tasks, where versatile solutions are currently lacking. To address this, we introduce the Self-Contrastive Forward-Forward (SCFF) method, inspired by self-supervised contrastive learning. SCFF generates positive and negative examples applicable across different datasets, surpassing existing local forward algorithms for unsupervised classification accuracy on MNIST (MLP: 98.7%), CIFAR-10 (CNN: 80.75%), and STL-10 (CNN: 77.3%). Additionally, SCFF is the first to enable FF training of recurrent neural networks, opening the door to more complex tasks and continuous-time video and text processing.
前馈-前馈(FF)算法是一种最近提出、完全朝前学习的算法,它通过局部和层级的更新来更新权重,并支持监督学习和无监督学习。这些特点使得它非常适合应用于诸如类脑学习、低功耗硬件神经网络和大模型分布式学习等应用。然而,尽管FF在书面数字识别任务上已经表现出良好的前景,但在自然图像和时间序列上的表现仍然具有挑战性。一个关键的限制是生成高质量的反例来进行对比学习,特别是在无监督任务中,目前还缺乏多样化的解决方案。为了解决这个问题,我们引入了自监督前馈-前馈(SCFF)方法,灵感来自自监督对比学习。SCFF生成适用于各种数据集的反例,超过了现有局部前馈算法的无监督分类准确率(MNIST:98.7%,CIFAR-10:80.75%,STL-10:77.3%)。此外,SCFF是第一个使FF训练循环神经网络成为可能,为更复杂任务和连续时间视频和文本处理打开了大门。
https://arxiv.org/abs/2409.11593
Humans can picture a sound scene given an imprecise natural language description. For example, it is easy to imagine an acoustic environment given a phrase like "the lion roar came from right behind me!". For a machine to have the same degree of comprehension, the machine must know what a lion is (semantic attribute), what the concept of "behind" is (spatial attribute) and how these pieces of linguistic information align with the semantic and spatial attributes of the sound (what a roar sounds like when its coming from behind). State-of-the-art audio foundation models which learn to map between audio scenes and natural textual descriptions, are trained on non-spatial audio and text pairs, and hence lack spatial awareness. In contrast, sound event localization and detection models are limited to recognizing sounds from a fixed number of classes, and they localize the source to absolute position (e.g., 0.2m) rather than a position described using natural language (e.g., "next to me"). To address these gaps, we present ELSA a spatially aware-audio and text embedding model trained using multimodal contrastive learning. ELSA supports non-spatial audio, spatial audio, and open vocabulary text captions describing both the spatial and semantic components of sound. To train ELSA: (a) we spatially augment the audio and captions of three open-source audio datasets totaling 4,738 hours of audio, and (b) we design an encoder to capture the semantics of non-spatial audio, and the semantics and spatial attributes of spatial audio using contrastive learning. ELSA is competitive with state-of-the-art for both semantic retrieval and 3D source localization. In particular, ELSA achieves +2.8% mean audio-to-text and text-to-audio R@1 above the baseline, and outperforms by -11.6° mean-absolute-error in 3D source localization over the baseline.
人类可以通过不精确的自然语言描述来想象声音场景。例如,用“狮子咆哮从我身后传来!”这样的短语描述一个 acoustic 环境很容易想象。对于机器具有与人类相同的理解程度,机器必须知道狮子是什么(语义属性)、"背后"的概念(空间属性),以及这些语言信息如何与声音的语义和空间属性对齐(一个咆哮声从哪里来)。 最先进的音频基础模型通过学习将音频场景映射到自然文本描述中进行训练,这些模型对抗性训练非空间音频和文本对,因此缺乏空间意识。相比之下,声音事件定位和检测模型只能识别来自固定类别的声音,并将源定位到绝对位置(例如,0.2米),而不是用自然语言描述的位置。为了填补这些空白,我们提出了 ELSA,一种具有空间意识的音频和文本嵌入模型,使用多模态对比学习进行训练。ELSA 支持非空间音频、空间音频和用自然语言描述的开放词汇文本注解,描述声音的语义和空间成分。要训练 ELSA:(a) 我们将三个开源音频数据集中的音频和文本进行空间增强,总共 4,738 小时音频;(b) 我们设计了一个编码器,利用对比学习捕捉非空间音频的语义和空间属性,以及空间音频的语义和空间属性。ELSA 在语义检索和 3D 源本地化方面与最先进的模型相当。特别地,ELSA 在基准线上的平均音频到文本和文本到音频的 R@1 比率为 +2.8%,在基准线上实现了比 -11.6° 的绝对误差更优异的 3D 源本地化表现。
https://arxiv.org/abs/2409.11369
Numerous methods have been proposed to adapt a pre-trained foundational CLIP model for few-shot classification. As CLIP is trained on a large corpus, it generalises well through adaptation to few-shot classification. In this work, we analyse the intra-modal overlap in image space in terms of embedding representation. Our analysis shows that, due to contrastive learning, embeddings from CLIP model exhibit high cosine similarity distribution overlap in the image space between paired and unpaired examples affecting the performance of few-shot training-free classification methods which rely on similarity in the image space for their predictions. To tackle intra-modal overlap we propose to train a lightweight adapter on a generic set of samples from the Google Open Images dataset demonstrating that this improves accuracy for few-shot training-free classification. We validate our contribution through extensive empirical analysis and demonstrate that reducing the intra-modal overlap leads to a) improved performance on a number of standard datasets, b) increased robustness to distribution shift and c) higher feature variance rendering the features more discriminative for downstream tasks.
为了适应少样本分类,已经提出了许多方法来对预训练的CLIP模型进行修改。由于CLIP在大数据集上训练,因此通过适应少样本分类表现良好。在这项工作中,我们分析图像空间中内模型的嵌入表示。我们的分析表明,由于对比学习,CLIP模型的嵌入在成对和未成对实例之间在图像空间表现出高余弦相似性分布重叠,这会影响那些基于图像空间预测的少样本训练 free 的分类算法的准确性。为了应对内模冲突,我们提出了一个轻量级的适配器,在Google Open Images数据集中的通用样本上进行训练,证明了这对少样本训练free分类的准确性有所提高。通过广泛的实证分析验证我们的贡献,我们发现减少内模冲突会导致以下结果:a) 在多个标准数据集上的性能提高;b) 对分布漂移的增加容错性;c) 提高下游任务的特征变异性,使得特征更具判别性。
https://arxiv.org/abs/2409.11338
Aspect-based sentiment analysis (ABSA) involves identifying sentiment towards specific aspect terms in a sentence and allows us to uncover nuanced perspectives and attitudes on particular aspects of a product, service, or topic. However, the scarcity of labeled data poses a significant challenge to training high-quality models. To address this issue, we explore the potential of data augmentation using ChatGPT, a well-performing large language model (LLM), to enhance the sentiment classification performance towards aspect terms. Specifically, we explore three data augmentation strategies based on ChatGPT: context-focused, aspect-focused, and context-aspect data augmentation techniques. Context-focused data augmentation focuses on changing the word expression of context words in the sentence while keeping aspect terms unchanged. In contrast, aspect-focused data augmentation aims to change aspect terms but keep context words unchanged. Context-Aspect data augmentation integrates the above two data augmentations to generate augmented samples. Furthermore, we incorporate contrastive learning into the ABSA tasks to improve performance. Extensive experiments show that all three data augmentation techniques lead to performance improvements, with the context-aspect data augmentation strategy performing best and surpassing the performance of the baseline models.
基于 aspect 的情感分析(ASSA)涉及在句子中识别对特定方面词的 sentiment,并允许我们揭示产品、服务或主题的特定方面上的细微观点和态度。然而,稀疏的标注数据对训练高质量模型造成了重大挑战。为解决这个问题,我们探讨了使用 ChatGPT(表现出色的 large language model)进行数据增强以提高面向方面词的情感分类性能的可能性。具体来说,我们探讨了三种基于 ChatGPT 的数据增强策略:基于上下文的数据增强、基于方面的数据增强和上下文方面数据增强技术。基于上下文的数据增强关注于句子中上下文词的单词表达,而方面词保持不变。相反,基于方面的数据增强旨在改变方面词,而上下文词保持不变。上下文方面数据增强将上述两种数据增强技术相结合以生成增强样本。此外,我们将对比学习引入 ASSA 任务中,以提高性能。大量实验证明,三种数据增强技术都导致了性能提高,其中上下文方面数据增强策略表现最佳,并超过了基线模型的性能。
https://arxiv.org/abs/2409.11218
Self-supervised learning has proved effective for skeleton-based human action understanding. However, previous works either rely on contrastive learning that suffers false negative problems or are based on reconstruction that learns too much unessential low-level clues, leading to limited representations for downstream tasks. Recently, great advances have been made in generative learning, which is naturally a challenging yet meaningful pretext task to model the general underlying data distributions. However, the representation learning capacity of generative models is under-explored, especially for the skeletons with spacial sparsity and temporal redundancy. To this end, we propose Masked Conditional Diffusion (MacDiff) as a unified framework for human skeleton modeling. For the first time, we leverage diffusion models as effective skeleton representation learners. Specifically, we train a diffusion decoder conditioned on the representations extracted by a semantic encoder. Random masking is applied to encoder inputs to introduce a information bottleneck and remove redundancy of skeletons. Furthermore, we theoretically demonstrate that our generative objective involves the contrastive learning objective which aligns the masked and noisy views. Meanwhile, it also enforces the representation to complement for the noisy view, leading to better generalization performance. MacDiff achieves state-of-the-art performance on representation learning benchmarks while maintaining the competence for generative tasks. Moreover, we leverage the diffusion model for data augmentation, significantly enhancing the fine-tuning performance in scenarios with scarce labeled data. Our project is available at this https URL.
自监督学习已经在基于骨骼的人动作理解中取得了有效成果。然而,之前的工作要么依赖于对比学习,这种方法会受到虚假负面的困扰,要么是基于重构学习,这种方法会学习太多无关的低级线索,导致对于下游任务的表示有限。最近,在生成学习方面取得了很大的进展,生成学习自然是一个具有挑战性但有意義的先验任务,以建模通用数据分布。然而,生成模型的表示学习能力尚未得到充分利用,尤其是对于具有稀疏空间和时间冗余的骨骼。因此,我们提出了一种统一框架,用于人体骨骼建模,即遮蔽条件扩散(MacDiff)。这是第一次利用扩散模型作为有效的骨骼表示学习器。具体来说,我们使用语义编码器提取的表示来训练扩散解码器。对编码器输入应用随机遮蔽以引入信息瓶颈并消除骨架的冗余。此外,我们理论证明,我们的生成目标包括对比学习目标,使遮蔽和噪声视图对齐。同时,它还强制表示补充噪声视图,导致更好的泛化性能。MacDiff 在表示学习基准上实现了最先进的性能,同时保持了生成任务的竞争力。此外,我们还利用扩散模型进行数据增强,显著增强了在稀有标注数据场景下的微调性能。我们的项目可以在该https URL上查看。
https://arxiv.org/abs/2409.10473
Semi-supervised medical image segmentation has shown promise in training models with limited labeled data and abundant unlabeled data. However, state-of-the-art methods ignore a potentially valuable source of unsupervised semantic information -- spatial registration transforms between image volumes. To address this, we propose CCT-R, a contrastive cross-teaching framework incorporating registration information. To leverage the semantic information available in registrations between volume pairs, CCT-R incorporates two proposed modules: Registration Supervision Loss (RSL) and Registration-Enhanced Positive Sampling (REPS). The RSL leverages segmentation knowledge derived from transforms between labeled and unlabeled volume pairs, providing an additional source of pseudo-labels. REPS enhances contrastive learning by identifying anatomically-corresponding positives across volumes using registration transforms. Experimental results on two challenging medical segmentation benchmarks demonstrate the effectiveness and superiority of CCT-R across various semi-supervised settings, with as few as one labeled case. Our code is available at this https URL.
半监督医学图像分割已经在训练模型时使用有限标注数据和丰富无标注数据表现出良好的效果。然而,最先进的方法忽略了潜在的有价值的无监督语义信息——空间注册在图像卷之间转换。为解决这个问题,我们提出了CCT-R,一种包含注册信息的反向传播框架。为了利用体积对之间存在的语义信息,CCT-R包括两个提议模块:注册监督损失(RSL)和注册增强阳性采样(REPS)。RSL利用从标注和未标注体积对之间的变换得到的分割知识,提供了一个额外的伪标签来源。REPS通过使用注册变换识别解剖学对应的可信积极样本,增强了对抗性学习。在两个具有挑战性的医疗分割基准上进行的实验结果表明,CCT-R在各种半监督设置中具有有效性和优越性,甚至只需要一个标注案例。我们的代码可在此处访问:https://url.
https://arxiv.org/abs/2409.10422
Contrastive pretraining can substantially increase model generalisation and downstream performance. However, the quality of the learned representations is highly dependent on the data augmentation strategy applied to generate positive pairs. Positive contrastive pairs should preserve semantic meaning while discarding unwanted variations related to the data acquisition domain. Traditional contrastive pipelines attempt to simulate domain shifts through pre-defined generic image transformations. However, these do not always mimic realistic and relevant domain variations for medical imaging such as scanner differences. To tackle this issue, we herein introduce counterfactual contrastive learning, a novel framework leveraging recent advances in causal image synthesis to create contrastive positive pairs that faithfully capture relevant domain variations. Our method, evaluated across five datasets encompassing both chest radiography and mammography data, for two established contrastive objectives (SimCLR and DINO-v2), outperforms standard contrastive learning in terms of robustness to acquisition shift. Notably, counterfactual contrastive learning achieves superior downstream performance on both in-distribution and on external datasets, especially for images acquired with scanners under-represented in the training set. Further experiments show that the proposed framework extends beyond acquisition shifts, with models trained with counterfactual contrastive learning substantially improving subgroup performance across biological sex.
对比性预训练可以显著提高模型的泛化能力和下游性能。然而,所学习到的表示的质量对应用的数据增强策略至关重要。阳性对比性对应该保留语义含义,同时丢弃与数据收集领域相关的无关变异。传统的对比性管道试图通过预定义的通用图像变换来模拟领域变化。然而,这些并不总是模仿医学成像领域(如扫描仪差异)的真实和相关领域变异。为解决这个问题,我们在这里引入了反事实对比性学习,一种利用最近在因果图像合成领域的进展来创建对比性阳性对的新框架,以忠实捕捉相关领域变异。我们对五个数据集(包括胸部X光摄影和乳腺X光摄影数据)进行了评估,这两个 established 的对比性目标(SimCLR和DINO-v2),在对比性预训练方面的表现优于标准对比学习。值得注意的是,反事实对比性学习在分布内和分布外的数据上都取得了优越的性能,尤其是对于在训练集中扫描器表现不佳的图像。进一步的实验表明,所提出的框架不仅仅局限于获取到的扫描仪变化,使用反事实对比性学习的模型在生物性别子群体上显著提高子群体性能。
https://arxiv.org/abs/2409.10365
This thesis investigates the effectiveness of SimCLR, a contrastive learning technique, in Greek letter recognition, focusing on the impact of various augmentation techniques. We pretrain the SimCLR backbone using the Alpub dataset (pretraining dataset) and fine-tune it on a smaller ICDAR dataset (finetuning dataset) to compare SimCLR's performance against traditional baseline models, which use cross-entropy and triplet loss functions. Additionally, we explore the role of different data augmentation strategies, essential for the SimCLR training process. Methodologically, we examine three primary approaches: (1) a baseline model using cross-entropy loss, (2) a triplet embedding model with a classification layer, and (3) a SimCLR pretrained model with a classification layer. Initially, we train the baseline, triplet, and SimCLR models using 93 augmentations on ResNet-18 and ResNet-50 networks with the ICDAR dataset. From these, the top four augmentations are selected using a statistical t-test. Pretraining of SimCLR is conducted on the Alpub dataset, followed by fine-tuning on the ICDAR dataset. The triplet loss model undergoes a similar process, being pretrained on the top four augmentations before fine-tuning on ICDAR. Our experiments show that SimCLR does not outperform the baselines in letter recognition tasks. The baseline model with cross-entropy loss demonstrates better performance than both SimCLR and the triplet loss model. This study provides a detailed evaluation of contrastive learning for letter recognition, highlighting SimCLR's limitations while emphasizing the strengths of traditional supervised learning models in this task. We believe SimCLR's cropping strategies may cause a semantic shift in the input image, reducing training effectiveness despite the large pretraining dataset. Our code is available at this https URL.
本论文研究了SimCLR(一种对比学习技术)在希腊字母识别上的有效性,重点关注各种增强技术对性能的影响。我们使用Alpub数据集(预训练数据集)预训练SimCLR骨干网络,并在较小的IDAR数据集(微调数据集)上进行微调,以与传统基线模型(使用交叉熵和三元组损失函数)进行比较,进一步探索了不同数据增强策略在SimCLR训练过程中的作用。 方法论上,我们检查了三种主要方法:(1)使用交叉熵损失的基线模型,(2)具有分类层的三元组嵌入模型,和(3)带有分类层的SimCLR预训练模型。首先,我们在ResNet-18和ResNet-50网络上进行93次增强训练,使用IDAR数据集。从这些中,使用统计t检验选择前四个增强。对SimCLR的预训练是在Alpub数据集上进行的,然后在对IDAR数据集进行微调后进行。三元组损失模型也经历了一个类似的过程,在微调前对前四个增强进行预训练。 我们的实验结果表明,SimCLR在字母识别任务上并没有超越基线模型。具有交叉熵损失的基线模型在性能上比SimCLR和三元组损失模型更好。 本研究详细评估了对比学习在字母识别上的效果,同时强调了在任务上传统监督学习模型的优势,尽管在预训练大容量数据集的情况下,SimCLR的裁剪策略可能会导致输入图像的语义偏移,从而降低训练效果。我们的代码可在此处访问:https://www.xfile.io/file/4FN-4fEZ5B9eTU1V0Iz6A_cDh6D_/text/
https://arxiv.org/abs/2409.10156
As a form of biometric authentication technology, the security of speaker verification systems is of utmost importance. However, SV systems are inherently vulnerable to various types of attacks that can compromise their accuracy and reliability. One such attack is voice conversion, which modifies a persons speech to sound like another person by altering various vocal characteristics. This poses a significant threat to SV systems. To address this challenge, the Source Speaker Tracing Challenge in IEEE SLT2024 aims to identify the source speaker information in manipulated speech signals. Specifically, SSTC focuses on source speaker verification against voice conversion to determine whether two converted speech samples originate from the same source speaker. In this study, we propose a speaker contrastive learning-based approach for source speaker tracing to learn the latent source speaker information in converted speech. To learn a more source-speaker-related representation, we employ speaker contrastive loss during the training of the embedding extractor. This speaker contrastive loss helps identify the true source speaker embedding among several distractor speaker embeddings, enabling the embedding extractor to learn the potentially possessing source speaker information present in the converted speech. Experiments demonstrate that our proposed speaker contrastive learning system achieves the lowest EER of 16.788% on the challenge test set, securing first place in the challenge.
作为一种生物识别认证技术的形式,演讲验证系统的安全性至关重要。然而,SV系统固有地容易受到各种可能破坏其准确性和可靠性的攻击。一种 such 攻击是语音转换,通过改变各种 vocal 特徵,将一个人的声音转换成另一个人的声音。这给 SV 系统带来了严重威胁。为了应对这个挑战,IEEE SLT2024 中的源讲者跟踪挑战旨在确定操纵语音信号中的源讲者信息。具体来说,SSTC 关注源讲者验证,以确定两个转换后的语音样本是否来自同一个源讲者。在 this 研究中,我们提出了一个基于演讲者对比学习的方法来进行源讲者追踪,以学习转换后的语音中的潜在源讲者信息。为了学习更相关的源讲者表示,我们在嵌入提取器的训练过程中采用了发言者对比损失。这种发言者对比损失有助于在多个干扰性发言者表示中识别出真正的源讲者嵌入,使嵌入提取器能够学习转换后的语音中可能存在的源讲者信息。实验证明,我们提出的基于发言者对比学习的方法在挑战测试集上的最低 EER 为 16.788%,确保了挑战的冠军。
https://arxiv.org/abs/2409.10072
Vision-language models (VLMs) such as CLIP are trained via contrastive learning between text and image pairs, resulting in aligned image and text embeddings that are useful for many downstream tasks. A notable drawback of CLIP, however, is that the resulting embedding space seems to lack some of the structure of their purely text-based alternatives. For instance, while text embeddings have been long noted to satisfy \emph{analogies} in embedding space using vector arithmetic, CLIP has no such property. In this paper, we propose an approach to natively train CLIP in a contrastive manner to reason about differences in embedding space. We finetune CLIP so that the differences in image embedding space correspond to \emph{text descriptions of the image differences}, which we synthetically generate with large language models on image-caption paired datasets. We first demonstrate that our approach yields significantly improved capabilities in ranking images by a certain attribute (e.g., elephants are larger than cats), which is useful in retrieval or constructing attribute-based classifiers, and improved zeroshot classification performance on many downstream image classification tasks. In addition, our approach enables a new mechanism for inference that we refer to as comparative prompting, where we leverage prior knowledge of text descriptions of differences between classes of interest, achieving even larger performance gains in classification. Finally, we illustrate that the resulting embeddings obey a larger degree of geometric properties in embedding space, such as in text-to-image generation.
Vision-language models(VLMs)如CLIP是通过在文本和图像对之间进行对比学习进行训练的。这导致了对齐的图像和文本嵌入,对于许多下游任务非常有用。然而,CLIP的一个显著缺点是,生成的嵌入空间似乎缺乏其纯粹文本基础的替代品的某些结构。例如,虽然文本嵌入已经被指出在向量算术中满足相似性,但CLIP没有这种属性。在本文中,我们提出了一种在对比方式下原生训练CLIP的方法,以解释在嵌入空间中的差异。我们微调CLIP,使得图像嵌入空间中的差异对应于图像差异的文本描述,我们在图像-捕捉配对数据集上使用大型语言模型生成的。我们首先证明,我们的方法在通过特定属性排名图像方面产生了显著的改进(例如,大象比猫更大),这对于检索或构建基于属性的分类器非常有用,并且在许多下游图像分类任务上的 zero-shot 分类性能得到了提高。此外,我们的方法实现了一种新的推理机制,我们称之为比较提示,我们利用感兴趣类别的文本描述的差异先验知识,在分类中实现更大的性能提升。最后,我们说明生成的嵌入遵守了在嵌入空间中更大的几何性质,如文本到图像的生成。
https://arxiv.org/abs/2409.09721
We present a contrastive learning framework based on in-the-wild hand images tailored for pre-training 3D hand pose estimators, dubbed HandCLR. Pre-training on large-scale images achieves promising results in various tasks, but prior 3D hand pose pre-training methods have not fully utilized the potential of diverse hand images accessible from in-the-wild videos. To facilitate scalable pre-training, we first prepare an extensive pool of hand images from in-the-wild videos and design our method with contrastive learning. Specifically, we collected over 2.0M hand images from recent human-centric videos, such as 100DOH and Ego4D. To extract discriminative information from these images, we focus on the similarity of hands; pairs of similar hand poses originating from different samples, and propose a novel contrastive learning method that embeds similar hand pairs closer in the latent space. Our experiments demonstrate that our method outperforms conventional contrastive learning approaches that produce positive pairs sorely from a single image with data augmentation. We achieve significant improvements over the state-of-the-art method in various datasets, with gains of 15% on FreiHand, 10% on DexYCB, and 4% on AssemblyHands.
我们提出了一个基于野外手图像的对比学习框架,名为HandCLR。在大型图像上的预训练在各种任务上都取得了很好的效果,但先前的3D手姿势预训练方法并没有充分利用从野外视频获得的多样手图像的潜力。为了实现可扩展的预训练,我们首先从野外视频中准备了一个广泛的在手图像的池,并使用对比学习的方法设计我们的框架。具体来说,我们收集了来自最近的人机中心视频的超过2000万只手图像,如100DOH和Ego4D。为了从这些图像中提取有意义的信息,我们关注手之间的相似性;来自不同样本的双只相似的手姿势,并提出了一种新颖的对比学习方法,将相似的手对更接近地嵌入到潜在空间中。我们的实验结果表明,我们的方法优于使用数据增强技术从单张图片产生积极对的方法。在各种数据集上,我们的方法都取得了显著的改善,其中在FreiHand数据集上的改善率为15%,在DexYCB数据集上的改善率为10%,在AssemblyHands数据集上的改善率为4%。
https://arxiv.org/abs/2409.09714
Labelled data are limited and self-supervised learning is one of the most important approaches for reducing labelling requirements. While it has been extensively explored in the image domain, it has so far not received the same amount of attention in the acoustic domain. Yet, reducing labelling is a key requirement for many acoustic applications. Specifically in bioacoustic, there are rarely sufficient labels for fully supervised learning available. This has led to the widespread use of acoustic recognisers that have been pre-trained on unrelated data for bioacoustic tasks. We posit that training on the actual task data and combining self-supervised pre-training with few-shot classification is a superior approach that has the ability to deliver high accuracy even when only a few labels are available. To this end, we introduce and evaluate a new architecture that combines CNN-based preprocessing with feature extraction based on state space models (SSMs). This combination is motivated by the fact that CNN-based networks alone struggle to capture temporal information effectively, which is crucial for classifying acoustic signals. SSMs, specifically S4 and Mamba, on the other hand, have been shown to have an excellent ability to capture long-range dependencies in sequence data. We pre-train this architecture using contrastive learning on the actual task data and subsequent fine-tuning with an extremely small amount of labelled data. We evaluate the performance of this proposed architecture for ($n$-shot, $n$-class) classification on standard benchmarks as well as real-world data. Our evaluation shows that it outperforms state-of-the-art architectures on the few-shot classification problem.
带有标签的数据有限,自监督学习是减少标签要求的最重要方法之一。虽然它在图像领域得到了广泛探索,但在声学领域尚未得到同样的关注。然而,减少标签对于许多声学应用来说是关键需求。特别在生物声学中,由于很难获得充足的全监督学习标签,因此广泛使用基于无关数据预训练的声学识别器。我们提出,在实际任务数据上训练并结合自监督预训练与少样本分类是一种更优越的方法,即使只有几个标签时也能提供高准确度。为此,我们引入并评估了一种结合卷积神经网络(CNN)预处理和基于状态空间模型(SSM)的特征提取的新型架构。这一结合基于的事实是,仅基于CNN的网络很难有效地捕捉时间信息,这对于分类声波信号至关重要。另一方面,SSM,特别是S4和Mamba,已被证明具有在序列数据中捕获长距离依赖关系的良好能力。我们对这种架构使用基于对比学习在实际任务数据上进行预训练,然后通过极其少量的标记数据进行微调。我们在标准基准和现实世界数据上评估所提出的架构的性能。我们的评估显示,它在少样本分类问题上优于最先进的架构。
https://arxiv.org/abs/2409.09647
Traditional test-time training (TTT) methods, while addressing domain shifts, often assume a consistent class set, limiting their applicability in real-world scenarios characterized by infinite variety. Open-World Test-Time Training (OWTTT) addresses the challenge of generalizing deep learning models to unknown target domain distributions, especially in the presence of strong Out-of-Distribution (OOD) data. Existing TTT methods often struggle to maintain performance when confronted with strong OOD data. In OWTTT, the focus has predominantly been on distinguishing between overall strong and weak OOD data. However, during the early stages of TTT, initial feature extraction is hampered by interference from strong OOD and corruptions, resulting in diminished contrast and premature classification of certain classes as strong OOD. To address this, we introduce Open World Dynamic Contrastive Learning (OWDCL), an innovative approach that utilizes contrastive learning to augment positive sample pairs. This strategy not only bolsters contrast in the early stages but also significantly enhances model robustness in subsequent stages. In comparison datasets, our OWDCL model has produced the most advanced performance.
传统测试时间训练(TTT)方法在解决领域转移的同时,通常假定具有相同的类别集合,这限制了其在现实世界中具有无限变化的应用。开放世界测试时间训练(OWTTT)方法解决了将深度学习模型扩展到未知目标领域分布的挑战,尤其是在存在强烈的外部离散数据(OOD)数据的情况下。现有的TTT方法在面对强OUO数据时往往无法保持性能。在OWTTT中,重点主要集中在区分整体强和弱OUO数据。然而,在TTT的早期阶段,由于强OUO数据和干扰,初始特征提取遇到困难,导致某些类的对比度减弱和过早分类为强OUO。为解决这个问题,我们引入了开放世界动态对比学习(OWDCL),一种利用对比学习来增强积极样本对的方法。这种策略不仅在 TTT 的早期阶段增强了对比,而且在后续阶段显著增强了模型的鲁棒性。在比较数据集上,我们的OWDCL模型已经产生了最先进的成绩。
https://arxiv.org/abs/2409.09591
The success of Vision Language Models (VLMs) on various vision-language tasks heavily relies on pre-training with large scale web-crawled datasets. However, the noisy and incomplete nature of web data makes dataset scale crucial for performance, rendering end-to-end training increasingly prohibitive. In this paper, we propose NEVLP, a noise-robust framework for efficient vision-language pre-training that requires less pre-training data. Specifically, we bridge the modality gap between a frozen image encoder and a large language model with a transformer and introduce two innovative learning strategies: noise-adaptive learning and concept-enhanced learning to mitigate the impact of noise. In noise-adaptive learning, we estimate the noise probability of each image-text pair based on the transformer's memorization effect and employ noise-adaptive regularization on image-text contrastive learning to condition cross-modal alignment. In concept-enhanced learning, we enrich incomplete text by incorporating visual concepts (objects in the image) to provide prior information about existing objects for image-text matching and image-grounded text generation, thereby mitigating text incompletion. Our framework effectively utilizes noisy web data and achieves state-of-the-art performance with less pre-training data across a wide range of vision-language tasks, including image-text retrieval, image captioning, and visual question answering.
要在各种视觉语言任务上取得成功,Vision Language Models (VLMs) 依赖于使用大规模爬取的互联网数据集进行预训练。然而,互联网数据的嘈杂和不完整性质使得数据集规模对性能至关重要,导致端到端训练变得越来越困难。在本文中,我们提出了 NEVLP,一种噪音容量的框架,用于高效的视觉语言预训练,需要更少的预训练数据。具体来说,我们通过将静止的图像编码器与大型语言模型(使用Transformer)相连接,引入了两种创新的学习策略:噪音适应学习和概念增强学习,以减轻噪音的影响。在噪音适应学习中,我们根据Transformer的记忆效应估计每个图像文本对之间的噪音概率,并在图像文本对比学习上应用噪音适应 regularization,从而条件跨模态对齐。在概念增强学习中,我们通过引入图像中的视觉概念(图像中的物体)来丰富不完整的文本,为图像文本匹配和图像 grounded 文本生成提供先验信息,从而减轻文本不完整。我们的框架有效地利用了噪音互联网数据,在广泛的视觉语言任务中实现了与预训练数据更少的性能,包括图像文本检索、图像标题和视觉问题回答。
https://arxiv.org/abs/2409.09582
Analyzing real-world multimodal signals is an essential and challenging task for intelligent voice assistants (IVAs). Mainstream approaches have achieved remarkable performance on various downstream tasks of IVAs with pre-trained audio models and text models. However, these models are pre-trained independently and usually on tasks different from target domains, resulting in sub-optimal modality representations for downstream tasks. Moreover, in many domains, collecting enough language-audio pairs is extremely hard, and transcribing raw audio also requires high professional skills, making it difficult or even infeasible to joint pre-training. To address these painpoints, we propose DSCLAP, a simple and effective framework that enables language-audio pre-training with only raw audio signal input. Specifically, DSCLAP converts raw audio signals into text via an ASR system and combines a contrastive learning objective and a language-audio matching objective to align the audio and ASR transcriptions. We pre-train DSCLAP on 12,107 hours of in-vehicle domain audio. Empirical results on two downstream tasks show that while conceptually simple, DSCLAP significantly outperforms the baseline models in all metrics, showing great promise for domain-specific IVAs applications.
分析真实世界多模态信号是智能语音助手(IVA)的一项关键且具有挑战性的任务。主流方法在预训练音频模型和文本模型的各种下游任务上取得了显著的性能。然而,这些模型是独立预训练的,通常针对不同的目标领域,导致下游任务的模态表示效果较差。此外,在许多领域,收集足够的语言-音频对是非常困难的,转录 raw 音频也需要高度专业技能,使得联合预训练变得困难或甚至不可行。为了应对这些痛点,我们提出了 DSCLAP,一个简单而有效的框架,仅通过原始音频信号输入进行语言-音频预训练。具体来说,DSCLAP 通过 ASR 系统将原始音频信号转换为文本,并结合对比学习目标和一个语言-音频匹配目标来对音频和 ASR 转录进行对齐。我们在车载领域音频上预训练 DSCLAP 12,107 小时。在两个下游任务的实证结果表明,虽然从概念上讲很简单,但 DSCLAP 在所有指标上都显著优于基线模型,显示出在领域特定 IVA 应用中有很大的潜力。
https://arxiv.org/abs/2409.09289
Acoustic identification of individual animals (AIID) is closely related to audio-based species classification but requires a finer level of detail to distinguish between individual animals within the same species. In this work, we frame AIID as a hierarchical multi-label classification task and propose the use of hierarchy-aware loss functions to learn robust representations of individual identities that maintain the hierarchical relationships among species and taxa. Our results demonstrate that hierarchical embeddings not only enhance identification accuracy at the individual level but also at higher taxonomic levels, effectively preserving the hierarchical structure in the learned representations. By comparing our approach with non-hierarchical models, we highlight the advantage of enforcing this structure in the embedding space. Additionally, we extend the evaluation to the classification of novel individual classes, demonstrating the potential of our method in open-set classification scenarios.
动物个体识别(AIID)与基于音频的物种分类密切相关,但需要更细微的细节来区分同一物种内个体之间的差异。在本文中,我们将AIID建模为具有层次结构的多标签分类任务,并提出使用层次感知损失函数来学习个体身份保持物种和等级之间关系的稳健表示。我们的结果表明,层次化嵌入不仅提高了个体层面的识别准确性,而且在更高的分类级别上也有所提高,有效保护了学习表示中的层次结构。通过与非层次化模型的比较,我们强调了在嵌入空间中维护层次结构的优势。此外,我们还将评估扩展到预测新的个体类别的分类,展示了我们方法在开放集分类场景中的潜在能力。
https://arxiv.org/abs/2409.08673
Passive acoustic monitoring (PAM) is crucial for bioacoustic research, enabling non-invasive species tracking and biodiversity monitoring. Citizen science platforms like Xeno-Canto provide large annotated datasets from focal recordings, where the target species is intentionally recorded. However, PAM requires monitoring in passive soundscapes, creating a domain shift between focal and passive recordings, which challenges deep learning models trained on focal recordings. To address this, we leverage supervised contrastive learning to improve domain generalization in bird sound classification, enforcing domain invariance across same-class examples from different domains. We also propose ProtoCLR (Prototypical Contrastive Learning of Representations), which reduces the computational complexity of the SupCon loss by comparing examples to class prototypes instead of pairwise comparisons. Additionally, we present a new few-shot classification benchmark based on BirdSet, a large-scale bird sound dataset, and demonstrate the effectiveness of our approach in achieving strong transfer performance.
被动声监测(PAM)对生物声学研究至关重要,实现非侵入性物种跟踪和生物多样性监测。像Xeno-Canto这样的公民科学平台提供了来自焦点录音的大型注释数据集,其中目标物种有意记录。然而,PAM需要在被动声景中进行监测,从而在焦点录音和被动录音之间创造一个域转移,这使得基于焦点录音的深度学习模型训练具有挑战性。为了应对这个问题,我们利用有监督的对比学习来提高鸟类声分类的领域通用性,在同一类别的不同领域内保持领域不变。我们还提出了ProtoCLR(代表性对比学习),它通过将示例与类原样进行比较来降低SupCon损失的计算复杂性。此外,我们基于BirdSet,一个大规模鸟类声音数据集,提出了一个新的几 shot分类基准,并证明了我们的方法在实现强大转移性能方面的有效性。
https://arxiv.org/abs/2409.08589