Self-supervised contrastive learning has emerged as one of the most successful deep learning paradigms. In this regard, it has seen extensive use in image registration and, more recently, in the particular field of medical image registration. In this work, we propose to test and extend and improve a state-of-the-art framework for color fundus image registration, ConKeD. Using the ConKeD framework we test multiple loss functions, adapting them to the framework and the application domain. Furthermore, we evaluate our models using the standarized benchmark dataset FIRE as well as several datasets that have never been used before for color fundus registration, for which we are releasing the pairing data as well as a standardized evaluation approach. Our work demonstrates state-of-the-art performance across all datasets and metrics demonstrating several advantages over current SOTA color fundus registration methods
自监督对比学习已经成为最成功的深度学习范式之一。在这方面,它在图像配准和更近期的医学图像配准领域看到了广泛的应用。在这项工作中,我们提出了一个用于测试和改进最先进的颜色 fundus 图像配准框架ConKeD的框架。使用ConKeD框架我们测试了多个损失函数,并将其适应框架和应用领域。此外,我们还使用标准化基准数据集FIRE以及之前没有用于颜色 fundus 图像配准的数据集来评估我们的模型。我们的工作在所有数据集和指标上都展示了当前最佳性能,并比当前最佳方法具有几个优势。
https://arxiv.org/abs/2404.16773
Beyond improving trust and validating model fairness, xAI practices also have the potential to recover valuable scientific insights in application domains where little to no prior human intuition exists. To that end, we propose a method to extract global concept explanations from the predictions of graph neural networks to develop a deeper understanding of the tasks underlying structure-property relationships. We identify concept explanations as dense clusters in the self-explaining Megan models subgraph latent space. For each concept, we optimize a representative prototype graph and optionally use GPT-4 to provide hypotheses about why each structure has a certain effect on the prediction. We conduct computational experiments on synthetic and real-world graph property prediction tasks. For the synthetic tasks we find that our method correctly reproduces the structural rules by which they were created. For real-world molecular property regression and classification tasks, we find that our method rediscovers established rules of thumb. More specifically, our results for molecular mutagenicity prediction indicate more fine-grained resolution of structural details than existing explainability methods, consistent with previous results from chemistry literature. Overall, our results show promising capability to extract the underlying structure-property relationships for complex graph property prediction tasks.
除了提高信任度和验证模型的公平性外,基于AI的研究还有可能在缺乏先前人类直觉的应用领域中恢复有价值的科学见解。为此,我们提出了一种从图神经网络的预测中提取全局概念解释的方法,以更深入地理解支撑任务结构与属性之间关系的任务结构。我们将概念解释确定为自解释Megan模型的子图潜在空间中的密集聚类。对于每个概念,我们优化一个具有代表性的图,并可选地使用GPT-4来提供关于每个结构对预测的影响的假设。我们在合成和现实世界的图属性预测任务上进行计算实验。对于合成任务,我们发现我们的方法正确地复制了它们创建的结构规则。对于现实世界的分子属性回归和分类任务,我们发现我们的方法重新发现了已有的经验法则。具体来说,我们的分子突变预测结果表明,我们的方法比现有的解释性方法具有更细粒度的结构细节的分辨率,这与化学文献中的 previous results 相一致。总体而言,我们的结果表明,基于AI的研究具有从复杂图属性预测任务中提取底层结构与属性之间关系的有前景的能力。
https://arxiv.org/abs/2404.16532
Semi-supervised action recognition aims to improve spatio-temporal reasoning ability with a few labeled data in conjunction with a large amount of unlabeled data. Albeit recent advancements, existing powerful methods are still prone to making ambiguous predictions under scarce labeled data, embodied as the limitation of distinguishing different actions with similar spatio-temporal information. In this paper, we approach this problem by empowering the model two aspects of capability, namely discriminative spatial modeling and temporal structure modeling for learning discriminative spatio-temporal representations. Specifically, we propose an Adaptive Contrastive Learning~(ACL) strategy. It assesses the confidence of all unlabeled samples by the class prototypes of the labeled data, and adaptively selects positive-negative samples from a pseudo-labeled sample bank to construct contrastive learning. Additionally, we introduce a Multi-scale Temporal Learning~(MTL) strategy. It could highlight informative semantics from long-term clips and integrate them into the short-term clip while suppressing noisy information. Afterwards, both of these two new techniques are integrated in a unified framework to encourage the model to make accurate predictions. Extensive experiments on UCF101, HMDB51 and Kinetics400 show the superiority of our method over prior state-of-the-art approaches.
半监督动作识别旨在通过与大量未标记数据相结合,通过几标记数据来提高空间和时间推理能力。尽管有最近的研究进展,但现有的强大方法在稀疏标记数据下仍然容易产生模糊预测,这表现为用类似的时空信息区分不同动作的局限性。在本文中,我们通过赋予模型两个能力方面来解决这个问题,即判别性空间建模和时间结构建模,以学习具有判别性的时空表示。具体来说,我们提出了自适应对比学习(ACL)策略。它通过标记数据的类原型评估所有未标记样本的置信度,并从预标记样本库中选择正负样本进行对比学习。此外,我们还引入了多尺度时间学习(MTL)策略。它可以从长期视频片段中突出有用的语义信息,并将它们整合到短期视频片段中,同时抑制噪声信息。然后,这两种新的技术都被融入到统一的框架中,以鼓励模型做出准确的预测。在UCF101、HMDB51和Kinetics400等数据集上进行的大量实验表明,我们的方法优越于先前的最先进方法。
https://arxiv.org/abs/2404.16416
The unique artistic style is crucial to artists' occupational competitiveness, yet prevailing Art Commission Platforms rarely support style-based retrieval. Meanwhile, the fast-growing generative AI techniques aggravate artists' concerns about releasing personal artworks to public platforms. To achieve artistic style-based retrieval without exposing personal artworks, we propose FedStyle, a style-based federated learning crowdsourcing framework. It allows artists to train local style models and share model parameters rather than artworks for collaboration. However, most artists possess a unique artistic style, resulting in severe model drift among them. FedStyle addresses such extreme data heterogeneity by having artists learn their abstract style representations and align with the server, rather than merely aggregating model parameters lacking semantics. Besides, we introduce contrastive learning to meticulously construct the style representation space, pulling artworks with similar styles closer and keeping different ones apart in the embedding space. Extensive experiments on the proposed datasets demonstrate the superiority of FedStyle.
独特的美学风格对艺术家职业竞争力至关重要,然而现行的艺术委员会平台 rarely 支持基于风格的音乐检索。与此同时,快速增长的生成式 AI 技术使艺术家对将个人作品发布到公共平台感到担忧。为了实现基于美学风格的音乐检索而不会泄露个人作品,我们提出了 FedStyle,一种基于风格的分众学习框架。它允许艺术家训练本地风格模型并共享模型参数,而不是为了合作而共享作品。然而,大多数艺术家具有独特的艺术风格,导致他们之间的模型漂移严重。FedStyle 通过让艺术家学习其抽象风格表示来解决这种极端的数据异质性,而不是简单地聚合缺乏语义的数据参数。此外,我们还引入了对比学习来精心构建风格表示空间,将具有相似风格的作品推向更靠近,将不同风格的作品保持在空间中。在提出的数据集上进行的大量实验证明 FedStyle 的优越性。
https://arxiv.org/abs/2404.16336
The increasing prevalence of audio deepfakes poses significant security threats, necessitating robust detection methods. While existing detection systems exhibit promise, their robustness against malicious audio manipulations remains underexplored. To bridge the gap, we undertake the first comprehensive study of the susceptibility of the most widely adopted audio deepfake detectors to manipulation attacks. Surprisingly, even manipulations like volume control can significantly bypass detection without affecting human perception. To address this, we propose CLAD (Contrastive Learning-based Audio deepfake Detector) to enhance the robustness against manipulation attacks. The key idea is to incorporate contrastive learning to minimize the variations introduced by manipulations, therefore enhancing detection robustness. Additionally, we incorporate a length loss, aiming to improve the detection accuracy by clustering real audios more closely in the feature space. We comprehensively evaluated the most widely adopted audio deepfake detection models and our proposed CLAD against various manipulation attacks. The detection models exhibited vulnerabilities, with FAR rising to 36.69%, 31.23%, and 51.28% under volume control, fading, and noise injection, respectively. CLAD enhanced robustness, reducing the FAR to 0.81% under noise injection and consistently maintaining an FAR below 1.63% across all tests. Our source code and documentation are available in the artifact repository (this https URL).
音频深度伪造技术的普遍增加带来了显著的安全威胁,需要强大的检测方法。虽然现有的检测系统表现出巨大的潜力,但它们对抗恶意音频编辑的鲁棒性仍然缺乏深入的研究。为了弥合这个差距,我们开展了第一个全面研究,旨在评估最广泛采用的音频深度伪造检测器对编辑攻击的易感性。令人惊讶的是,即使包括像音量控制在内的编辑攻击也可以在没有任何影响人类感知的情况下显著绕过检测。为了应对这个问题,我们提出了CLAD(基于对比学习的音频深度伪造检测器),以增强对抗编辑攻击的鲁棒性。关键思想是利用对比学习最小化编辑操作带来的变化,从而提高检测器的鲁棒性。此外,我们还引入了长度损失,旨在通过将实音频在特征空间中聚类得更紧密来提高检测准确性。我们对最广泛采用的音频深度伪造检测模型和我们的CLAD进行了全面评估,对抗各种编辑攻击。检测模型显示出漏洞,在音量控制、衰减和噪音注入等情况下,FAR分别上升至36.69%、31.23%和51.28%。CLAD增强了鲁棒性,在噪音注入下的FAR降至0.81%,并且在所有测试中都保持了FAR低于1.63%的稳定性。我们的源代码和文档可在此处下载(此https URL)。
https://arxiv.org/abs/2404.15854
Contrastive learning has emerged as a transformative method for learning effective visual representations through the alignment of image and text embeddings. However, pairwise similarity computation in contrastive loss between image and text pairs poses computational challenges. This paper presents a novel weakly supervised pre-training of vision models on web-scale image-text data. The proposed method reframes pre-training on image-text data as a classification task. Consequently, it eliminates the need for pairwise similarity computations in contrastive loss, achieving a remarkable $2.7\times$ acceleration in training speed compared to contrastive learning on web-scale data. Through extensive experiments spanning diverse vision tasks, including detection and segmentation, we demonstrate that the proposed method maintains high representation quality. Our source code along with pre-trained model weights and training recipes is available at \url{this https URL}.
对比学习已成为通过图像和文本嵌入之间的对齐来学习有效视觉表示的一种变革性方法。然而,在图像和文本对之间的对比损失计算中,计算对偶相似性提出了计算挑战。本文提出了一种在面向互联网大小的图像-文本数据上的弱监督预训练视觉模型的新方法。将图像-文本数据的预训练重新定义为分类任务。因此,它消除了在对比学习在互联网大小的数据上进行对偶相似性计算的需求,实现了与对比学习在互联网大小的数据上训练的速度相比,训练速度提高了2.7倍。通过广泛的实验,包括检测和分割等不同视觉任务,我们证明了所提出的方法具有高表示质量。我们的源代码以及预训练模型权重和训练 recipe可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2404.15653
We introduce HybridVC, a voice conversion (VC) framework built upon a pre-trained conditional variational autoencoder (CVAE) that combines the strengths of a latent model with contrastive learning. HybridVC supports text and audio prompts, enabling more flexible voice style conversion. HybridVC models a latent distribution conditioned on speaker embeddings acquired by a pretrained speaker encoder and optimises style text embeddings to align with the speaker style information through contrastive learning in parallel. Therefore, HybridVC can be efficiently trained under limited computational resources. Our experiments demonstrate HybridVC's superior training efficiency and its capability for advanced multi-modal voice style conversion. This underscores its potential for widespread applications such as user-defined personalised voice in various social media platforms. A comprehensive ablation study further validates the effectiveness of our method.
我们提出了HybridVC,一种基于预训练条件变异自动编码器(CVAE)的语音转换(VC)框架,结合了潜在模型的优势和对比学习的力量。HybridVC支持文本和音频提示,实现更灵活的语音风格转换。HybridVC基于预训练说话人编码器获得的说话人嵌入,通过并行对比学习优化样式文本嵌入,使其与说话人风格信息对齐。因此,HybridVC可以在有限的计算资源下高效训练。我们的实验证明了HybridVC卓越的训练效率和其在高级多模态语音风格转换方面的能力。这进一步证明了其在各种社交媒体平台中实现用户定义个性化语音的广泛应用潜力。全面的消融研究进一步验证了我们的方法的有效性。
https://arxiv.org/abs/2404.15637
Medical Vision-Language Pretraining (Med-VLP) establishes a connection between visual content from medical images and the relevant textual descriptions. Existing Med-VLP methods primarily focus on 2D images depicting a single body part, notably chest X-rays. In this paper, we extend the scope of Med-VLP to encompass 3D images, specifically targeting full-body scenarios, by using a multimodal dataset of CT images and reports. Compared with the 2D counterpart, 3D VLP is required to effectively capture essential semantics from significantly sparser representation in 3D imaging. In this paper, we introduce CT-GLIP (Grounded Language-Image Pretraining with CT scans), a novel method that constructs organ-level image-text pairs to enhance multimodal contrastive learning, aligning grounded visual features with precise diagnostic text. Additionally, we developed an abnormality dictionary to augment contrastive learning with diverse negative samples. Our method, trained on a multimodal CT dataset comprising 44,011 organ-level vision-text pairs from 17,702 patients across 104 organs, demonstrates it can identify organs and abnormalities in a zero-shot manner using natural languages. The performance of CT-GLIP is validated on a separate test set of 1,130 patients, focusing on the 16 most frequent abnormalities across 7 organs. The experimental results show our model's superior performance over the standard CLIP framework across zero-shot and fine-tuning scenarios, using both CNN and ViT architectures.
医疗视觉-语言预训练(Med-VLP)建立了视觉内容从医学图像和相关的文本描述之间的联系。现有的Med-VLP方法主要集中在描述单个身体部位的2D图像,特别是胸部X光片。在本文中,我们将Med-VLP的视野扩展到包括3D图像,特别是全身情景,通过使用包含CT图像和报告的多模态数据集。与2D版本相比,3D VLP需要有效地从显著稀疏的3D成像表示中捕捉关键语义信息。本文我们引入了CT-GLIP(基于CT的 grounded 语言-图像预训练),一种新颖的方法,用于构建器官级别的图像-文本对以增强多模态对比学习,将 grounded visual features 与精确的诊断文本对齐。此外,我们还开发了一个异常情况词典,以增加对比学习中的多样负样本。我们的方法,在包括17,702名患者跨越104个器官的44,011个器官级别视觉-文本对的多模态CT数据集上进行训练,能够以零散的方式识别器官和异常情况。CT-GLIP的性能在一个包括1,130名患者的独立测试集上进行了验证,重点关注7个器官中最常见的异常情况。实验结果表明,在我们的模型在零散和微调场景下超过了标准CLIP框架,使用了CNN和ViT架构。
https://arxiv.org/abs/2404.15272
Blocking is a critical step in entity resolution, and the emergence of neural network-based representation models has led to the development of dense blocking as a promising approach for exploring deep semantics in blocking. However, previous advanced self-supervised dense blocking approaches require domain-specific training on the target domain, which limits the benefits and rapid adaptation of these methods. To address this issue, we propose UBlocker, a dense blocker that is pre-trained on a domain-independent, easily-obtainable tabular corpus using self-supervised contrastive learning. By conducting domain-independent pre-training, UBlocker can be adapted to various downstream blocking scenarios without requiring domain-specific fine-tuning. To evaluate the universality of our entity blocker, we also construct a new benchmark covering a wide range of blocking tasks from multiple domains and scenarios. Our experiments show that the proposed UBlocker, without any domain-specific learning, significantly outperforms previous self- and unsupervised dense blocking methods and is comparable and complementary to the state-of-the-art sparse blocking methods.
阻塞是在实体识别过程中一个关键的步骤,基于神经网络的表示模型的出现使得密集阻塞作为一种探索深度语义的有效方法而受到关注。然而,之前的高级自监督密集阻塞方法需要针对目标域进行领域特定的训练,这限制了这些方法的好处和快速适应能力。为了解决这个问题,我们提出了UBlocker,一种在自监督对比学习的基础上预训练于无领域无关、易于获取的表格语料库的密集阻塞方法。通过进行无领域的预训练,UBlocker可以适应各种下游阻塞场景,而无需进行领域特定的微调。为了评估我们实体阻塞器的普适性,我们还构建了一个新的基准,涵盖了多个领域和场景的广泛阻塞任务。我们的实验结果表明,与没有进行任何领域特定学习相比,所提出的UBlocker在阻塞任务中显著超过了以前的自监督和无监督密集阻塞方法,与最先进的稀疏阻塞方法相当,互补且具有优势。
https://arxiv.org/abs/2404.14831
The Text to Audible-Video Generation (TAVG) task involves generating videos with accompanying audio based on text descriptions. Achieving this requires skillful alignment of both audio and video elements. To support research in this field, we have developed a comprehensive Text to Audible-Video Generation Benchmark (TAVGBench), which contains over 1.7 million clips with a total duration of 11.8 thousand hours. We propose an automatic annotation pipeline to ensure each audible video has detailed descriptions for both its audio and video contents. We also introduce the Audio-Visual Harmoni score (AVHScore) to provide a quantitative measure of the alignment between the generated audio and video modalities. Additionally, we present a baseline model for TAVG called TAVDiffusion, which uses a two-stream latent diffusion model to provide a fundamental starting point for further research in this area. We achieve the alignment of audio and video by employing cross-attention and contrastive learning. Through extensive experiments and evaluations on TAVGBench, we demonstrate the effectiveness of our proposed model under both conventional metrics and our proposed metrics.
TAVG(文本到听觉视频生成)任务涉及根据文本描述生成带有音频的视频。要实现这一目标,需要对音频和视频元素进行精确的同步。为了支持该领域的研究,我们开发了一个全面的文本到听觉视频生成基准(TAVGBench),包含超过1700万段时长为11.8万小时的片段。我们提出了一种自动注释管道,以确保每个听觉视频都有其音频和视频内容的详细描述。我们还引入了音频视觉和谐分数(AVHScore)来提供生成音频和视频之间 alignment 的定量测量。此外,我们还推出了TAVDiffusion baseline模型,该模型使用两个流式latent扩散模型为该领域进一步研究提供了一个基本的起点。通过在TAVGBench上进行广泛的实验和评估,我们证明了我们提出的模型在常规指标和提出的指标上都具有有效性。
https://arxiv.org/abs/2404.14381
We present PLUTO, a powerful framework that pushes the limit of imitation learning-based planning for autonomous driving. Our improvements stem from three pivotal aspects: a longitudinal-lateral aware model architecture that enables flexible and diverse driving behaviors; An innovative auxiliary loss computation method that is broadly applicable and efficient for batch-wise calculation; A novel training framework that leverages contrastive learning, augmented by a suite of new data augmentations to regulate driving behaviors and facilitate the understanding of underlying interactions. We assessed our framework using the large-scale real-world nuPlan dataset and its associated standardized planning benchmark. Impressively, PLUTO achieves state-of-the-art closed-loop performance, beating other competing learning-based methods and surpassing the current top-performed rule-based planner for the first time. Results and code are available at this https URL.
我们提出了PLUTO,一个强大的框架,可以将自动驾驶中基于模仿学习的规划极限推向更高。我们的改进源于三个关键方面:一个纵向-横向感知模型架构,实现灵活多样且和谐的驾驶行为;一种适用于批量计算的创新辅助损失计算方法;一种利用对比学习的新颖训练框架,通过一系列新的数据增强方法调节驾驶行为,并促进底层交互的理解。我们对PLUTO框架进行了评估,使用了大规模现实世界nuPlan数据集及其相关的标准化规划基准。令人印象深刻的是,PLUTO实现了最先进的闭环性能,超越了其他竞争性的基于学习的方法和当前最高表现的基于规则的规划器,这是第一次实现的。结果和代码可在此链接中查看:https://url.org/
https://arxiv.org/abs/2404.14327
In this paper, we present a simple yet effective contrastive knowledge distillation approach, which can be formulated as a sample-wise alignment problem with intra- and inter-sample constraints. Unlike traditional knowledge distillation methods that concentrate on maximizing feature similarities or preserving class-wise semantic correlations between teacher and student features, our method attempts to recover the "dark knowledge" by aligning sample-wise teacher and student logits. Specifically, our method first minimizes logit differences within the same sample by considering their numerical values, thus preserving intra-sample similarities. Next, we bridge semantic disparities by leveraging dissimilarities across different samples. Note that constraints on intra-sample similarities and inter-sample dissimilarities can be efficiently and effectively reformulated into a contrastive learning framework with newly designed positive and negative pairs. The positive pair consists of the teacher's and student's logits derived from an identical sample, while the negative pairs are formed by using logits from different samples. With this formulation, our method benefits from the simplicity and efficiency of contrastive learning through the optimization of InfoNCE, yielding a run-time complexity that is far less than $O(n^2)$, where $n$ represents the total number of training samples. Furthermore, our method can eliminate the need for hyperparameter tuning, particularly related to temperature parameters and large batch sizes. We conduct comprehensive experiments on three datasets including CIFAR-100, ImageNet-1K, and MS COCO. Experimental results clearly confirm the effectiveness of the proposed method on both image classification and object detection tasks. Our source codes will be publicly available at this https URL.
在本文中,我们提出了一种简单而有效的对比性知识蒸馏方法,可以将其表述为样本层面的对齐问题,具有内部样本和跨样本约束。与传统的知识蒸馏方法不同,该方法试图通过将样本层面的教师和学生的对数值对齐来恢复“暗知识”,具体来说,我们的方法首先通过考虑它们的数值值来最小化同一样本内的对数值差异,从而保留内部样本相似性。接下来,我们通过利用不同样本之间的差异来桥通常的语义差异。需要注意的是,对内部样本相似性和跨样本差异的限制可以有效地转化为一个新的设计的有向二进制对齐学习框架。其中一对正对是由相同的样本生成的教师和学生的对数值,而负对则是由不同样本的推理得出的。通过这种表示方法,我们的方法通过优化InfoNCE实现了对比学习的高效性和效率,其运行时间复杂度远低于$O(n^2)$,其中$n$表示训练样本的总数。此外,我们的方法可以消除关于温度参数和大批量的超参数 tuning需求,特别与温度参数和大的批量大小的相关。我们对包括CIFAR-100、ImageNet-1K和MS COCO在内的三个数据集进行了全面的实验。实验结果明确证实了所提出方法在图像分类和目标检测任务上的有效性。我们的源代码将公开发布在这个https URL上。
https://arxiv.org/abs/2404.14109
Recent advances in generative visual models and neural radiance fields have greatly boosted 3D-aware image synthesis and stylization tasks. However, previous NeRF-based work is limited to single scene stylization, training a model to generate 3D-aware cartoon faces with arbitrary styles remains unsolved. We propose ArtNeRF, a novel face stylization framework derived from 3D-aware GAN to tackle this problem. In this framework, we utilize an expressive generator to synthesize stylized faces and a triple-branch discriminator module to improve the visual quality and style consistency of the generated faces. Specifically, a style encoder based on contrastive learning is leveraged to extract robust low-dimensional embeddings of style images, empowering the generator with the knowledge of various styles. To smooth the training process of cross-domain transfer learning, we propose an adaptive style blending module which helps inject style information and allows users to freely tune the level of stylization. We further introduce a neural rendering module to achieve efficient real-time rendering of images with higher resolutions. Extensive experiments demonstrate that ArtNeRF is versatile in generating high-quality 3D-aware cartoon faces with arbitrary styles.
近年来,在生成视觉模型和神经辐射场方面取得了显著的进展,极大地推动了3D感知图像合成和风格化任务的发展。然而,先前的基于NeRF的工作仅限于单场景风格化,将模型训练为生成任意风格的三维卡通面部仍然是一个未解决的问题。我们提出了ArtNeRF,一种基于3D感知GAN的新颖面部风格化框架,以解决这个问题。在这个框架中,我们利用具有表现力的生成器合成风格化的面部,并采用三重分支的判别器模块来提高生成的面的视觉质量和风格一致性。具体来说,我们基于对比学习的方法提出了一个风格编码器,提取出风格图的低维度嵌入,使得生成器获得各种风格的知识。为了平滑跨域迁移学习的训练过程,我们提出了一个自适应风格融合模块,有助于注入风格信息,并允许用户自由调整风格水平。我们还引入了神经渲染模块,以实现高分辨率图像的实时渲染。大量的实验结果表明,ArtNeRF在生成具有任意风格的高质量3D卡通面部方面具有多样性。
https://arxiv.org/abs/2404.13711
Temporal sentence grounding involves the retrieval of a video moment with a natural language query. Many existing works directly incorporate the given video and temporally localized query for temporal grounding, overlooking the inherent domain gap between different modalities. In this paper, we utilize pseudo-query features containing extensive temporally global textual knowledge sourced from the same video-query pair, to enhance the bridging of domain gaps and attain a heightened level of similarity between multi-modal features. Specifically, we propose a Pseudo-query Intermediary Network (PIN) to achieve an improved alignment of visual and comprehensive pseudo-query features within the feature space through contrastive learning. Subsequently, we utilize learnable prompts to encapsulate the knowledge of pseudo-queries, propagating them into the textual encoder and multi-modal fusion module, further enhancing the feature alignment between visual and language for better temporal grounding. Extensive experiments conducted on the Charades-STA and ActivityNet-Captions datasets demonstrate the effectiveness of our method.
翻译 Temporal sentence grounding 涉及从自然语言查询中检索具有自然语言查询的视频时刻。 许多现有作品直接包含给定的视频和时间局部化的查询以进行时间关联,而忽视了不同模态之间固有的领域差距。在本文中,我们利用包含相同视频查询对中广泛的时间全局文本知识伪查询特征,来增强领域之间的桥梁,达到多模态特征之间更高的相似度。 具体来说,我们提出了一种名为 Pseudo-query Intermediary Network (PIN) 的伪查询中间网络,通过对比学习在特征空间中改善视觉和综合伪查询特征的对齐。 然后,我们使用可学习提示来封装伪查询的知识,将其传递到文本编码器和多模态融合模块中,进一步加强了视觉和语言之间的特征匹配,实现更好的时间关联。 在 Charades-STA 和 ActivityNet-Captions 数据集上进行的大量实验证明了我们方法的有效性。
https://arxiv.org/abs/2404.13611
Arbitrary style transfer holds widespread attention in research and boasts numerous practical applications. The existing methods, which either employ cross-attention to incorporate deep style attributes into content attributes or use adaptive normalization to adjust content features, fail to generate high-quality stylized images. In this paper, we introduce an innovative technique to improve the quality of stylized images. Firstly, we propose Style Consistency Instance Normalization (SCIN), a method to refine the alignment between content and style features. In addition, we have developed an Instance-based Contrastive Learning (ICL) approach designed to understand the relationships among various styles, thereby enhancing the quality of the resulting stylized images. Recognizing that VGG networks are more adept at extracting classification features and need to be better suited for capturing style features, we have also introduced the Perception Encoder (PE) to capture style features. Extensive experiments demonstrate that our proposed method generates high-quality stylized images and effectively prevents artifacts compared with the existing state-of-the-art methods.
任意风格迁移在研究和实践中具有广泛的关注,并拥有许多实际应用。现有方法中,e要么采用跨注意来将深度风格属性融入内容属性,要么使用自适应归一化来调整内容特征,都无法生成高质量的风格化图像。在本文中,我们提出了改进风格化图像质量的新技术。首先,我们提出了Style Consistency Instance Normalization(SCIN)方法,这是一种优化内容与风格特征之间对齐的方法。此外,我们还开发了一种基于实例的对比学习(ICL)方法,旨在理解各种风格之间的关系,从而提高生成风格化图像的质量。认识到VGG网络更擅长提取分类特征,需要更好地适应捕捉风格特征,我们还引入了感知编码器(PE)以捕捉风格特征。大量实验证明,与现有最先进的方法相比,我们提出的方法生成的风格化图像质量高,并有效防止了伪影。
https://arxiv.org/abs/2404.13584
Conversational search requires accurate interpretation of user intent from complex multi-turn contexts. This paper presents ChatRetriever, which inherits the strong generalization capability of large language models to robustly represent complex conversational sessions for dense retrieval. To achieve this, we propose a simple and effective dual-learning approach that adapts LLM for retrieval via contrastive learning while enhancing the complex session understanding through masked instruction tuning on high-quality conversational instruction tuning data. Extensive experiments on five conversational search benchmarks demonstrate that ChatRetriever substantially outperforms existing conversational dense retrievers, achieving state-of-the-art performance on par with LLM-based rewriting approaches. Furthermore, ChatRetriever exhibits superior robustness in handling diverse conversational contexts. Our work highlights the potential of adapting LLMs for retrieval with complex inputs like conversational search sessions and proposes an effective approach to advance this research direction.
对话搜索需要从复杂的多轮对话背景下准确理解用户意图。本文提出ChatRetriever,它继承了大语言模型的强大泛化能力,用于稳健地表示复杂对话以进行密集检索。为此,我们提出了一种简单而有效的双重学习方法,通过在高质量对话指令调整数据上进行遮罩指令微调,将LLM用于检索,同时增强复杂对话会话的理解。在五个对话搜索基准上的大量实验证明,ChatRetriever显著优于现有对话密集检索器,在LLM基于重写方法的性能水平上实现了最先进的性能。此外,ChatRetriever在处理多样对话上下文方面表现出卓越的鲁棒性。我们的工作突出了将LLM用于对话搜索具有复杂输入的可能性,并提出了一种有效的方法来推动这一研究方向的发展。
https://arxiv.org/abs/2404.13556
Visual Place Recognition (VPR) aims to estimate the location of an image by treating it as a retrieval problem. VPR uses a database of geo-tagged images and leverages deep neural networks to extract a global representation, called descriptor, from each image. While the training data for VPR models often originates from diverse, geographically scattered sources (geo-tagged images), the training process itself is typically assumed to be centralized. This research revisits the task of VPR through the lens of Federated Learning (FL), addressing several key challenges associated with this adaptation. VPR data inherently lacks well-defined classes, and models are typically trained using contrastive learning, which necessitates a data mining step on a centralized database. Additionally, client devices in federated systems can be highly heterogeneous in terms of their processing capabilities. The proposed FedVPR framework not only presents a novel approach for VPR but also introduces a new, challenging, and realistic task for FL research, paving the way to other image retrieval tasks in FL.
VPR(视觉空间识别)旨在通过将其视为检索问题来估计图像的位置。VPR使用一个带有地理标记的图像数据库,并利用深度神经网络从每张图像中提取全局表示,称为描述符。虽然VPR模型的训练数据通常来自地理上分散的来源(带有地理标记的图像),但通常假设训练过程是集中的。这项研究通过Federated Learning(FL)的视角重新审视了VPR任务,解决了与这种适应相关的几个关键挑战。VPR数据固有的类本不明确,通常使用对比学习进行训练,这需要在一个集中式的数据库上进行数据挖掘。此外,分布式系统中的客户端设备在处理能力上可能高度异构。所提出的FedVPR框架不仅为VPR带来了新颖的方法,还为FL研究引入了一个新的、具有挑战性和真实性的任务,为FL领域中的其他图像检索任务铺平道路。
https://arxiv.org/abs/2404.13324
Despite the advancement of deep learning-based computer-aided diagnosis (CAD) methods for pneumonia from adult chest x-ray (CXR) images, the performance of CAD methods applied to pediatric images remains suboptimal, mainly due to the lack of large-scale annotated pediatric imaging datasets. Establishing a proper framework to leverage existing adult large-scale CXR datasets can thus enhance pediatric pneumonia detection performance. In this paper, we propose a three-branch parallel path learning-based framework that utilizes both adult and pediatric datasets to improve the performance of deep learning models on pediatric test datasets. The paths are trained with pediatric only, adult only, and both types of CXRs, respectively. Our proposed framework utilizes the multi-positive contrastive loss to cluster the classwise embeddings and the embedding similarity loss among these three parallel paths to make the classwise embeddings as close as possible to reduce the effect of domain shift. Experimental evaluations on open-access adult and pediatric CXR datasets show that the proposed method achieves a superior AUROC score of 0.8464 compared to 0.8348 obtained using the conventional approach of join training on both datasets. The proposed approach thus paves the way for generalized CAD models that are effective for both adult and pediatric age groups.
尽管基于深度学习的计算机辅助诊断(CAD)方法在成人胸部X光(CXR)图像上取得了进步,但是应用于儿童图像的CAD方法的表现仍然较低,主要原因是缺乏大型注释的儿童成像数据集。因此,建立一个适当的框架来利用现有的大规模成人CXR数据集可以提高儿童肺炎检测性能。在本文中,我们提出了一个三分支并行路径学习框架,该框架利用了成人和儿童数据集,以提高儿童测试数据上深度学习模型的性能。这些路径分别通过儿童仅、成人仅和两者的组合进行训练。我们提出的框架利用多正比对比损失来对这三个并行的路径进行聚类,利用嵌入相似性损失来使这三个路径的类内嵌入尽可能接近,以减少领域漂移的影响。在公开访问的成人及儿童CXR数据集上进行实验评估,结果显示,与传统方法将两个数据集的数据分别合并训练相比,所提出的 method 实现了 superior 的 AUROC 分数为 0.8464,而传统方法的 AUROC 分数为 0.8348。因此,所提出的框架为通用的 CAD 模型铺平了道路,这些模型对成人和儿童年龄段都有效。
https://arxiv.org/abs/2404.12958
Localizing the exact pathological regions in a given medical scan is an important imaging problem that requires a large amount of bounding box ground truth annotations to be accurately solved. However, there exist alternative, potentially weaker, forms of supervision, such as accompanying free-text reports, which are readily available. The task of performing localization with textual guidance is commonly referred to as phrase grounding. In this work, we use a publicly available Foundation Model, namely the Latent Diffusion Model, to solve this challenging task. This choice is supported by the fact that the Latent Diffusion Model, despite being generative in nature, contains mechanisms (cross-attention) that implicitly align visual and textual features, thus leading to intermediate representations that are suitable for the task at hand. In addition, we aim to perform this task in a zero-shot manner, i.e., without any further training on target data, meaning that the model's weights remain frozen. To this end, we devise strategies to select features and also refine them via post-processing without extra learnable parameters. We compare our proposed method with state-of-the-art approaches which explicitly enforce image-text alignment in a joint embedding space via contrastive learning. Results on a popular chest X-ray benchmark indicate that our method is competitive wih SOTA on different types of pathology, and even outperforms them on average in terms of two metrics (mean IoU and AUC-ROC). Source code will be released upon acceptance.
在同一医学扫描中准确定位病理性区域是一个重要的图像问题,需要大量约束框 ground truth 注释才能准确解决。然而,存在 alternative、可能更弱的形式监督,例如随附的免费文本报告,这些监督形式非常容易获得。用文本指导进行局部化的工作通常称为短语 grounding。在这项工作中,我们使用一个公开的可用于所有目的的模型,即Latent Diffusion Model(LDM)来解决这个问题具有挑战性的任务。这个选择得到了事实的支持,尽管LDM在本质上具有生成性,但它包含了一些隐含的与视觉和文本特征对齐的机制,从而导致适合该任务的中间表示。此外,我们还希望通过零样本的方式执行这项任务,即不需要对目标数据进行进一步训练,这意味着模型的权重将保持不变。为此,我们设计了一些策略来选择特征,并通过后处理来精炼它们,而无需额外的学习参数。我们比较了我们的方法与最先进的通过对比学习在联合嵌入空间中明确实现图像-文本对齐的方法。在一项流行的胸部X光挑战中,我们的方法与最先进的方法在各种类型的病理性上具有竞争力,甚至平均而言优于它们。源代码将在接受时发布。
https://arxiv.org/abs/2404.12920
We introduce Contrastive Gaussian Clustering, a novel approach capable of provide segmentation masks from any viewpoint and of enabling 3D segmentation of the scene. Recent works in novel-view synthesis have shown how to model the appearance of a scene via a cloud of 3D Gaussians, and how to generate accurate images from a given viewpoint by projecting on it the Gaussians before $\alpha$ blending their color. Following this example, we train a model to include also a segmentation feature vector for each Gaussian. These can then be used for 3D scene segmentation, by clustering Gaussians according to their feature vectors; and to generate 2D segmentation masks, by projecting the Gaussians on a plane and $\alpha$ blending over their segmentation features. Using a combination of contrastive learning and spatial regularization, our method can be trained on inconsistent 2D segmentation masks, and still learn to generate segmentation masks consistent across all views. Moreover, the resulting model is extremely accurate, improving the IoU accuracy of the predicted masks by $+8\%$ over the state of the art. Code and trained models will be released soon.
我们介绍了一种名为 Contrastive Gaussian Clustering 的新方法,它可以从任何视角提供分割掩码,并实现场景的 3D 分割。最近的新视图合成工作展示了如何通过 3D 高斯云来建模场景的 appearance,以及如何在给定视角上投影高斯并在 $\alpha$ 融合后生成准确图像的方法。遵循这个例子,我们训练了一个模型,每个高斯还包括一个分割特征向量。这些特征向量可以用于 3D 场景分割,通过根据其特征向量聚类高斯;还可以用于生成 2D 分割掩码,通过在平面上投影高斯并在其分割特征上进行 $\alpha$ 融合。通过对比学习与空间正则化,我们的方法可以在不一致的 2D 分割掩码上进行训练,同时仍然能在所有视角上生成一致的分割掩码。此外,所得模型非常准确,预测掩码的 IoU 准确率提高了 $+8\%$ 以上。代码和训练好的模型不久将发布。
https://arxiv.org/abs/2404.12784