User Interface (UI) understanding has been an increasingly popular topic over the last few years. So far, there has been a vast focus solely on web and mobile applications. In this paper, we introduce the harder task of computer UI understanding. With the goal of enabling research in this field, we have generated a dataset with a set of videos where a user is performing a sequence of actions and each image shows the desktop contents at that time point. We also present a framework that is composed of a synthetic sample generation pipeline to augment the dataset with relevant characteristics, and a contrastive learning method to classify images in the videos. We take advantage of the natural conditional, tree-like, relationship of the images' characteristics to regularize the learning of the representations by dealing with multiple partial tasks simultaneously. Experimental results show that the proposed framework outperforms previously proposed hierarchical multi-label contrastive losses in fine-grain UI classification.
用户界面(UI)理解是一个越来越热门的话题,并且在过去的几年里,主要集中在Web和移动应用程序上。在本文中,我们提出了一个更难的任务:计算机UI理解。旨在促进此领域的研究,我们生成了一个数据集,其中用户会执行一系列操作,并且每张图片都显示了此时桌面内容。我们还提出了一个由合成样本生成管道和视频中的图像对比学习方法组成的框架。我们利用图像特征的自然条件树状关系,在处理多个部分任务的同时,对表示的学习进行正则化。实验结果表明,与之前提出的分层多标签对比损失相比,所提出的框架在细粒度UI分类上表现优异。
https://arxiv.org/abs/2403.10170
We introduce eCLIP, an enhanced version of the CLIP model that integrates expert annotations in the form of radiologist eye-gaze heatmaps. It tackles key challenges in contrastive multi-modal medical imaging analysis, notably data scarcity and the "modality gap" -- a significant disparity between image and text embeddings that diminishes the quality of representations and hampers cross-modal interoperability. eCLIP integrates a heatmap processor and leverages mixup augmentation to efficiently utilize the scarce expert annotations, thus boosting the model's learning effectiveness. eCLIP is designed to be generally applicable to any variant of CLIP without requiring any modifications of the core architecture. Through detailed evaluations across several tasks, including zero-shot inference, linear probing, cross-modal retrieval, and Retrieval Augmented Generation (RAG) of radiology reports using a frozen Large Language Model, eCLIP showcases consistent improvements in embedding quality. The outcomes reveal enhanced alignment and uniformity, affirming eCLIP's capability to harness high-quality annotations for enriched multi-modal analysis in the medical imaging domain.
我们介绍了一个名为eCLIP的增强版CLIP模型,它将专家注释的形式体现在放射科医生眼部热图上。它解决了对比 multi-modal 医学影像分析中的关键挑战,尤其是数据稀缺性和“模态差距”——这是图像和文本嵌入之间显著的差异,使得表示的质量下降,且阻碍跨模态互操作性。eCLIP 整合了一个热图处理器并利用混合增强来有效地利用稀缺的专家注释,从而提高模型的学习效果。eCLIP 旨在适用于任何 CLIP 的变体,而无需对核心架构进行任何修改。通过在多个任务上的详细评估,包括零散推理、线性探测、跨模态检索和 Retrieval Augmented Generation(RAG)用于放射学报告,eCLIP 展示了在嵌入质量方面的持续改进。结果表明,其加强了对比度和一致性,证实了 eCLIP 在医学影像领域中利用高质量注释进行丰富多模态分析的能力。
https://arxiv.org/abs/2403.10153
Audio-text retrieval (ATR), which retrieves a relevant caption given an audio clip (A2T) and vice versa (T2A), has recently attracted much research attention. Existing methods typically aggregate information from each modality into a single vector for matching, but this sacrifices local details and can hardly capture intricate relationships within and between modalities. Furthermore, current ATR datasets lack comprehensive alignment information, and simple binary contrastive learning labels overlook the measurement of fine-grained semantic differences between samples. To counter these challenges, we present a novel ATR framework that comprehensively captures the matching relationships of multimodal information from different perspectives and finer granularities. Specifically, a fine-grained alignment method is introduced, achieving a more detail-oriented matching through a multiscale process from local to global levels to capture meticulous cross-modal relationships. In addition, we pioneer the application of cross-modal similarity consistency, leveraging intra-modal similarity relationships as soft supervision to boost more intricate alignment. Extensive experiments validate the effectiveness of our approach, outperforming previous methods by significant margins of at least 3.9% (T2A) / 6.9% (A2T) R@1 on the AudioCaps dataset and 2.9% (T2A) / 5.4% (A2T) R@1 on the Clotho dataset.
音频文本检索(ATR)是一种从音频片段(A2T)中检索相关标题,反之从文本中检索相关音频片段(T2A)的方法,最近吸引了大量研究关注。现有的方法通常将来自每个模态的信息聚合为单个向量进行匹配,但这会牺牲局部细节,并且很难捕捉模态之间的精细关系。此外,当前的ATR数据集缺乏全面的对齐信息,简单的二元对比学习标签忽视了样本之间的细微语义差异的测量。为了应对这些挑战,我们提出了一个新颖的ATR框架,全面捕捉不同角度和更细粒度的多模态信息之间的匹配关系。具体来说,我们引入了一种细粒度对齐方法,通过多尺度过程从局部到全局层次结构,捕捉细致的跨模态关系。此外,我们还开创性地应用了跨模态相似性一致性,利用内部模态相似关系作为软监督,以提高更复杂对齐的准确性。大量实验验证了我们的方法的有效性,在AudioCaps数据集上比前方法至少提高了3.9%(T2A)/6.9%(A2T)的R@1,在Clotho数据集上比前方法至少提高了2.9%(T2A)/5.4%(A2T)的R@1。
https://arxiv.org/abs/2403.10146
Multi-agent perception (MAP) allows autonomous systems to understand complex environments by interpreting data from multiple sources. This paper investigates intermediate collaboration for MAP with a specific focus on exploring "good" properties of collaborative view (i.e., post-collaboration feature) and its underlying relationship to individual views (i.e., pre-collaboration features), which were treated as an opaque procedure by most existing works. We propose a novel framework named CMiMC (Contrastive Mutual Information Maximization for Collaborative Perception) for intermediate collaboration. The core philosophy of CMiMC is to preserve discriminative information of individual views in the collaborative view by maximizing mutual information between pre- and post-collaboration features while enhancing the efficacy of collaborative views by minimizing the loss function of downstream tasks. In particular, we define multi-view mutual information (MVMI) for intermediate collaboration that evaluates correlations between collaborative views and individual views on both global and local scales. We establish CMiMNet based on multi-view contrastive learning to realize estimation and maximization of MVMI, which assists the training of a collaboration encoder for voxel-level feature fusion. We evaluate CMiMC on V2X-Sim 1.0, and it improves the SOTA average precision by 3.08% and 4.44% at 0.5 and 0.7 IoU (Intersection-over-Union) thresholds, respectively. In addition, CMiMC can reduce communication volume to 1/32 while achieving performance comparable to SOTA. Code and Appendix are released at this https URL.
多智能体感知(MAP)允许自主系统通过从多个来源解释数据来理解复杂环境。本文研究了MAP与特定关注点的中间合作,重点探索"好"合作视图(即后合作特征)及其与个体视图(即前合作特征)之间的潜在关系,这是大多数现有作品所处理为不可见的过程。我们提出了名为CMiMC(对比 mutual information 最大化协作感知)的新框架,用于中间合作。CMiMC的核心理念是通过最大化前和后合作特征之间的互信息来保留个体视图的区分性信息,同时通过最小化下游任务的损失函数来提高合作视图的有效性。 特别是,我们定义了多视图互信息(MVMI),用于中间合作,它评估了合作视图与个体视图之间的全局和局部规模上的相关性。我们基于多视图对比学习建立了CMiMNet,以实现MVMI的估计和最大化,从而辅助级联特征融合器的训练。我们在V2X-Sim 1.0上评估了CMiMC,它分别将平均精度提高了3.08%和4.44%在0.5和0.7 IoU(交集 over 统一)阈值。此外,CMiMC可以将通信量减少到1/32,同时实现与SOTA性能相当的结果。代码和附件发布在https://这个网址。
https://arxiv.org/abs/2403.10068
Lifelong person re-identification (LReID) assumes a practical scenario where the model is sequentially trained on continuously incoming datasets while alleviating the catastrophic forgetting in the old datasets. However, not only the training datasets but also the gallery images are incrementally accumulated, that requires a huge amount of computational complexity and storage space to extract the features at the inference phase. In this paper, we address the above mentioned problem by incorporating the backward-compatibility to LReID for the first time. We train the model using the continuously incoming datasets while maintaining the model's compatibility toward the previously trained old models without re-computing the features of the old gallery images. To this end, we devise the cross-model compatibility loss based on the contrastive learning with respect to the replay features across all the old datasets. Moreover, we also develop the knowledge consolidation method based on the part classification to learn the shared representation across different datasets for the backward-compatibility. We suggest a more practical methodology for performance evaluation as well where all the gallery and query images are considered together. Experimental results demonstrate that the proposed method achieves a significantly higher performance of the backward-compatibility compared with the existing methods. It is a promising tool for more practical scenarios of LReID.
终身人物识别(LReID)假定一个实际场景,即在连续的 incoming 数据集中对模型进行序列训练,同时减轻 old 数据集中的灾难性遗忘。然而,不仅训练数据集,还包括画廊图像,都需要积累大量的计算复杂度和存储空间,在推理阶段提取特征。在本文中,我们通过首次将反向兼容性引入 LReID,解决了上述提到的这个问题。我们在连续的 incoming 数据集上训练模型,同时保持模型对之前训练的旧模型的兼容性,而不重新计算旧画廊图像的特征。为此,我们根据所有 old 数据集的对比学习,设计了一种跨模态兼容性损失。此外,我们还基于部分分类开发了知识整合方法,以学习不同数据集之间的共享表示。我们建议一种更实际的性能评估方法,其中所有画廊和查询图像都被考虑在内。实验结果表明,与现有方法相比,所提出的方法在反向兼容性方面取得了显著的提高。这是一个有前景的工具,适用于更实际的 LReID 场景。
https://arxiv.org/abs/2403.10022
Self-supervised 3D representation learning aims to learn effective representations from large-scale unlabeled point clouds. Most existing approaches adopt point discrimination as the pretext task, which assigns matched points in two distinct views as positive pairs and unmatched points as negative pairs. However, this approach often results in semantically identical points having dissimilar representations, leading to a high number of false negatives and introducing a "semantic conflict" problem. To address this issue, we propose GroupContrast, a novel approach that combines segment grouping and semantic-aware contrastive learning. Segment grouping partitions points into semantically meaningful regions, which enhances semantic coherence and provides semantic guidance for the subsequent contrastive representation learning. Semantic-aware contrastive learning augments the semantic information extracted from segment grouping and helps to alleviate the issue of "semantic conflict". We conducted extensive experiments on multiple 3D scene understanding tasks. The results demonstrate that GroupContrast learns semantically meaningful representations and achieves promising transfer learning performance.
自监督3D表示学习旨在从大规模未标记点云中学习有效的表示。大多数现有方法采用点区分作为预处理任务,将两个不同视图中的匹配点分配为正对,将不匹配的点分配为负对。然而,这种方法通常会导致语义相同的点具有不同的表示,从而导致大量的假阴性结果,并引入了“语义冲突”问题。为了解决这个问题,我们提出了GroupContrast,一种结合 segment grouping 和语义感知对比学习的新颖方法。语义分组部分将点划分为语义上有意义的区域,这增强了语义连贯性并为后续的对比表示学习提供了语义指导。语义感知对比学习增加了从语义分组中提取的语义信息,并有助于减轻“语义冲突”问题。我们在多个3D场景理解任务上进行了广泛的实验。结果表明,GroupContrast能够学习语义上有意义的表示,并取得了鼓舞人心的转移学习效果。
https://arxiv.org/abs/2403.09639
Constructing a 3D scene capable of accommodating open-ended language queries, is a pivotal pursuit, particularly within the domain of robotics. Such technology facilitates robots in executing object manipulations based on human language directives. To tackle this challenge, some research efforts have been dedicated to the development of language-embedded implicit fields. However, implicit fields (e.g. NeRF) encounter limitations due to the necessity of processing a large number of input views for reconstruction, coupled with their inherent inefficiencies in inference. Thus, we present the GaussianGrasper, which utilizes 3D Gaussian Splatting to explicitly represent the scene as a collection of Gaussian primitives. Our approach takes a limited set of RGB-D views and employs a tile-based splatting technique to create a feature field. In particular, we propose an Efficient Feature Distillation (EFD) module that employs contrastive learning to efficiently and accurately distill language embeddings derived from foundational models. With the reconstructed geometry of the Gaussian field, our method enables the pre-trained grasping model to generate collision-free grasp pose candidates. Furthermore, we propose a normal-guided grasp module to select the best grasp pose. Through comprehensive real-world experiments, we demonstrate that GaussianGrasper enables robots to accurately query and grasp objects with language instructions, providing a new solution for language-guided manipulation tasks. Data and codes can be available at this https URL.
构建一个能够容纳开放性语言查询的三维场景是一个关键的追求,尤其是在机器人领域。这种技术使机器人能够根据人类语言指令执行物体操作。为了解决这个问题,一些研究努力致力于开发语言嵌入的隐含场。然而,隐含场(例如 NeRF)由于处理大量输入视图进行重构的必要性和其推理效率低下而遇到困难。因此,我们提出了 GaussianGrasper,它利用3D高斯平铺来明确表示场景为一个由高斯基本模型产生的高斯粒子的集合。我们的方法采用有限数量的RGB-D视图,并采用基于贴图的平铺技术创建一个特征场。特别地,我们提出了一个高效的特征蒸馏(EFD)模块,它采用对比学习有效地和准确地蒸馏来自基础模型的语言嵌入。通过全面的现实世界实验,我们证明了 GaussianGrasper 使机器人能够准确地用语言指令查询和抓取物体,为语言引导操作任务提供了一种新的解决方案。数据和代码可以在该https URL上获得。
https://arxiv.org/abs/2403.09637
Contrastive pretraining is well-known to improve downstream task performance and model generalisation, especially in limited label settings. However, it is sensitive to the choice of augmentation pipeline. Positive pairs should preserve semantic information while destroying domain-specific information. Standard augmentation pipelines emulate domain-specific changes with pre-defined photometric transformations, but what if we could simulate realistic domain changes instead? In this work, we show how to utilise recent progress in counterfactual image generation to this effect. We propose CF-SimCLR, a counterfactual contrastive learning approach which leverages approximate counterfactual inference for positive pair creation. Comprehensive evaluation across five datasets, on chest radiography and mammography, demonstrates that CF-SimCLR substantially improves robustness to acquisition shift with higher downstream performance on both in- and out-of-distribution data, particularly for domains which are under-represented during training.
对比性预训练已被证明可以提高下游任务性能和模型的泛化能力,尤其是在有限标签设置的情况下。然而,它对 augmentation 管道的选择非常敏感。积极对偶应该保留语义信息并破坏领域特定信息。标准的 augmentation 管道通过预定义的物理变换模拟领域特定的变化,但是如果我们能够模拟真实的领域变化呢?在这篇工作中,我们展示了如何利用最近在反事实图像生成方面的进步来达到这个效果。我们提出了 CF-SimCLR,一种反事实对比学习方法,它利用近似的反事实推理来创建积极对偶。在五个数据集(包括胸部X光片和乳腺X光片)的全面评估表明,CF-SimCLR 在提高 acquisition shift 的同时显著增强了对外部分布数据和内部分布数据的鲁棒性,尤其是在训练过程中代表性不足的领域。
https://arxiv.org/abs/2403.09605
Recent advancements in self-supervised audio-visual representation learning have demonstrated its potential to capture rich and comprehensive representations. However, despite the advantages of data augmentation verified in many learning methods, audio-visual learning has struggled to fully harness these benefits, as augmentations can easily disrupt the correspondence between input pairs. To address this limitation, we introduce EquiAV, a novel framework that leverages equivariance for audio-visual contrastive learning. Our approach begins with extending equivariance to audio-visual learning, facilitated by a shared attention-based transformation predictor. It enables the aggregation of features from diverse augmentations into a representative embedding, providing robust supervision. Notably, this is achieved with minimal computational overhead. Extensive ablation studies and qualitative results verify the effectiveness of our method. EquiAV outperforms previous works across various audio-visual benchmarks.
近年来,自监督音频视觉表示学习的发展已经证明了其捕捉丰富而全面的表示的潜力。然而,尽管在许多学习方法中数据增强验证了很多优势,但音频视觉学习仍未能充分利用这些优势,因为增强会轻易破坏输入对之间的对应关系。为了克服这个局限,我们引入了EquiAV,一种利用等价性进行音频视觉对比学习的新框架。我们的方法首先通过基于共享注意的变换预测器扩展等价性至音频视觉学习。这使得来自不同增强的feature可以聚合到一个具有代表性的嵌入,从而提供强监督。值得注意的是,这可以在最小计算开销的情况下实现。广泛的消融研究和定性结果证实了我们的方法的有效性。EquiAV在各种音频视觉基准测试中都优于之前的 works。
https://arxiv.org/abs/2403.09502
While the introduction of contrastive learning frameworks in sentence representation learning has significantly contributed to advancements in the field, it still remains unclear whether state-of-the-art sentence embeddings can capture the fine-grained semantics of sentences, particularly when conditioned on specific perspectives. In this paper, we introduce Hyper-CL, an efficient methodology that integrates hypernetworks with contrastive learning to compute conditioned sentence representations. In our proposed approach, the hypernetwork is responsible for transforming pre-computed condition embeddings into corresponding projection layers. This enables the same sentence embeddings to be projected differently according to various conditions. Evaluation on two representative conditioning benchmarks, namely conditional semantic text similarity and knowledge graph completion, demonstrates that Hyper-CL is effective in flexibly conditioning sentence representations, showcasing its computational efficiency at the same time. We also provide a comprehensive analysis of the inner workings of our approach, leading to a better interpretation of its mechanisms.
尽管在句子表示学习领域引入对比性学习框架已经显著推动了进步,但仍然不清楚最先进的句子嵌入是否能够捕捉到句子中微妙的语义,特别是当特定视角下时。在本文中,我们引入了Hyper-CL,一种将超网络与对比学习相结合的有效方法,用于计算有条件句子表示。在我们的方法中,超网络负责将预计算的有条件嵌入转换为相应的投影层。这使得根据各种条件投影相同的句子嵌入。在两个具有代表性的 conditioning 基准(即条件语义文本相似度和知识图谱完成)上的评估表明,Hyper-CL 有效地调节了句子表示,同时展示了其在计算效率方面的优势。我们还对我们的方法进行了全面的分析,从而更好地解释了其机制。
https://arxiv.org/abs/2403.09490
Classical object detectors are incapable of detecting novel class objects that are not encountered before. Regarding this issue, Open-Vocabulary Object Detection (OVOD) is proposed, which aims to detect the objects in the candidate class list. However, current OVOD models are suffering from overfitting on the base classes, heavily relying on the large-scale extra data, and complex training process. To overcome these issues, we propose a novel framework with Meta prompt and Instance Contrastive learning (MIC) schemes. Firstly, we simulate a novel-class-emerging scenario to help the prompt learner that learns class and background prompts generalize to novel classes. Secondly, we design an instance-level contrastive strategy to promote intra-class compactness and inter-class separation, which benefits generalization of the detector to novel class objects. Without using knowledge distillation, ensemble model or extra training data during detector training, our proposed MIC outperforms previous SOTA methods trained with these complex techniques on LVIS. Most importantly, MIC shows great generalization ability on novel classes, e.g., with $+4.3\%$ and $+1.9\% \ \mathrm{AP}$ improvement compared with previous SOTA on COCO and Objects365, respectively.
经典的物体检测器无法检测到以前未见过的类别的 novel 类物体。针对这个问题,我们提出了 Open-Vocabulary Object Detection (OVOD) 方法,该方法旨在检测候选类列表中的物体。然而,当前的 OVOD 模型在基础类上过拟合,过分依赖大规模的额外数据,并且训练过程复杂。为了克服这些问题,我们提出了一个具有元提示和实例对比学习 (MIC) 方案的新颖框架。首先,我们通过模拟一种新颖类别出现的场景,帮助提示学习者了解类和背景提示可以推广到新颖类别。其次,我们设计了一个实例级别的对比策略,以促进类内压缩和类间分离,从而有助于检测器对新颖类物体的泛化。在探测器训练过程中,我们没有使用知识蒸馏、集成模型或额外的训练数据,而我们的 MIC 方法在 LVIS 上的表现优于使用这些复杂技术的 previous SOTA 方法。最重要的是,MIC 在新颖类别上表现出出色的泛化能力,比如与 COCO 和 Objects365 上的 previous SOTA 分别相比,$+4.3\%$ 和 $+1.9\%$ 的 AP 改进。
https://arxiv.org/abs/2403.09433
Identifying highlight moments of raw video materials is crucial for improving the efficiency of editing videos that are pervasive on internet platforms. However, the extensive work of manually labeling footage has created obstacles to applying supervised methods to videos of unseen categories. The absence of an audio modality that contains valuable cues for highlight detection in many videos also makes it difficult to use multimodal strategies. In this paper, we propose a novel model with cross-modal perception for unsupervised highlight detection. The proposed model learns representations with visual-audio level semantics from image-audio pair data via a self-reconstruction task. To achieve unsupervised highlight detection, we investigate the latent representations of the network and propose the representation activation sequence learning (RASL) module with k-point contrastive learning to learn significant representation activations. To connect the visual modality with the audio modality, we use the symmetric contrastive learning (SCL) module to learn the paired visual and audio representations. Furthermore, an auxiliary task of masked feature vector sequence (FVS) reconstruction is simultaneously conducted during pretraining for representation enhancement. During inference, the cross-modal pretrained model can generate representations with paired visual-audio semantics given only the visual modality. The RASL module is used to output the highlight scores. The experimental results show that the proposed framework achieves superior performance compared to other state-of-the-art approaches.
识别原始视频材料的 highlight时刻对提高在互联网平台上编辑视频的效率至关重要。然而,手动标注视频的工作量巨大,这使得将监督方法应用于未见过的类别的视频上遇到了困难。许多视频中缺少包含许多视频 highlight 有用线索的音频模式,这也使得多模态策略难以使用。在本文中,我们提出了一个新颖的跨模态感知模型来进行无监督 highlight 检测。该模型通过自重构任务从图像-音频对数据中学习具有视觉-音频级别语义的代表。为了实现无监督 highlight 检测,我们研究了网络的潜在表示,并提出了具有 k-点对比学习的安全哈希(RASL)模块来学习重要的表示激活。为了将视觉模式与音频模式连接起来,我们使用对称对比学习(SCL)模块学习成对的视觉和音频表示。此外,在预训练期间还进行了遮罩特征向量序列(FVS)重构的辅助任务,以增强表示。在推理期间,跨模态预训练模型可以根据仅使用视觉模式生成具有成对视觉-音频语义的代表。RASL 模块用于输出 highlight 分数。实验结果表明,与其它最先进的解决方案相比,所提出的框架具有卓越的性能。
https://arxiv.org/abs/2403.09401
Learning medical visual representations through vision-language pre-training has reached remarkable progress. Despite the promising performance, it still faces challenges, i.e., local alignment lacks interpretability and clinical relevance, and the insufficient internal and external representation learning of image-report pairs. To address these issues, we propose an Anatomical Structure-Guided (ASG) framework. Specifically, we parse raw reports into triplets <anatomical region, finding, existence>, and fully utilize each element as supervision to enhance representation learning. For anatomical region, we design an automatic anatomical region-sentence alignment paradigm in collaboration with radiologists, considering them as the minimum semantic units to explore fine-grained local alignment. For finding and existence, we regard them as image tags, applying an image-tag recognition decoder to associate image features with their respective tags within each sample and constructing soft labels for contrastive learning to improve the semantic association of different image-report pairs. We evaluate the proposed ASG framework on two downstream tasks, including five public benchmarks. Experimental results demonstrate that our method outperforms the state-of-the-art methods.
通过视觉语言预训练学习医疗视觉表示已经取得了显著的进展。然而,尽管其具有令人鼓舞的性能,它仍然面临一些挑战,即局部对齐缺乏可解释性和临床相关性,以及图像报告对内部和外部表示学习不足。为解决这些问题,我们提出了一个解剖结构引导(ASG)框架。具体来说,我们将原始报告解析为三元组 <解剖结构,发现,存在>,并充分利用每个元素作为监督来增强表示学习。对于解剖结构,我们与放射科医生合作设计了一个自动解剖结构-句子对齐范式,将它们视为探索细粒度局部对齐的最小语义单位。对于发现和存在,我们将它们视为图像标签,应用图像标签识别解码器将图像特征与它们的相应标签关联,为对比学习构建软标签以提高不同图像报告对之间的语义关联。我们在两个下游任务上评估了所提出的ASG框架,包括五个公开基准。实验结果表明,我们的方法超越了最先进的方法。
https://arxiv.org/abs/2403.09294
Human-human communication is like a delicate dance where listeners and speakers concurrently interact to maintain conversational dynamics. Hence, an effective model for generating listener nonverbal behaviors requires understanding the dyadic context and interaction. In this paper, we present an effective framework for creating 3D facial motions in dyadic interactions. Existing work consider a listener as a reactive agent with reflexive behaviors to the speaker's voice and facial motions. The heart of our framework is Dyadic Interaction Modeling (DIM), a pre-training approach that jointly models speakers' and listeners' motions through masking and contrastive learning to learn representations that capture the dyadic context. To enable the generation of non-deterministic behaviors, we encode both listener and speaker motions into discrete latent representations, through VQ-VAE. The pre-trained model is further fine-tuned for motion generation. Extensive experiments demonstrate the superiority of our framework in generating listener motions, establishing a new state-of-the-art according to the quantitative measures capturing the diversity and realism of generated motions. Qualitative results demonstrate the superior capabilities of the proposed approach in generating diverse and realistic expressions, eye blinks and head gestures.
人类之间的交流就像一场优雅的舞蹈,其中听众和发言者同时相互作用以维持会话动态。因此,要建立一个有效的模型来生成听众的非语言行为,需要理解双人互动的上下文和相互作用。在本文中,我们提出了一个有效的框架来在双人交互中生成3D面部动作。现有的工作将听众视为对发言者声音的反应性代理,并假定面部动作是对发言者声音的直接反应。我们框架的核心是双人交互建模(DIM),一种通过遮罩和对比学习来共同建模发言者和听众运动的预训练方法,以学习捕捉到双人上下文的表示。为了实现非确定性行为,我们将听众和发言者的运动编码为离散的潜在表示,通过VQ-VAE。预训练模型进一步微调以进行运动生成。大量实验证明,我们的框架在生成听众运动方面具有优势,建立了根据生成运动的多样性和现实性新的领先水平。定性结果表明,与所提出的方法相比,具有生成多样化和真实表达、眼睑和头部动作的能力。
https://arxiv.org/abs/2403.09069
Autonomous driving stands as a pivotal domain in computer vision, shaping the future of transportation. Within this paradigm, the backbone of the system plays a crucial role in interpreting the complex environment. However, a notable challenge has been the loss of clear supervision when it comes to Bird's Eye View elements. To address this limitation, we introduce CLIP-BEVFormer, a novel approach that leverages the power of contrastive learning techniques to enhance the multi-view image-derived BEV backbones with ground truth information flow. We conduct extensive experiments on the challenging nuScenes dataset and showcase significant and consistent improvements over the SOTA. Specifically, CLIP-BEVFormer achieves an impressive 8.5\% and 9.2\% enhancement in terms of NDS and mAP, respectively, over the previous best BEV model on the 3D object detection task.
自动驾驶是计算机视觉领域的一个关键领域,塑造着未来的交通。在这个范式中,系统的骨干在解释复杂的环境方面扮演着关键角色。然而,在鸟瞰图元素方面缺乏明确的监督是一个显着的问题。为了应对这个局限,我们引入了CLIP-BEVFormer,一种利用对比学习技术增强多视角图像得出的BEV骨干,并随访地面真实信息流的新型方法。我们在具有挑战性的 nuScenes 数据集上进行了广泛的实验,并展示了CLIP-BEVFormer在最佳现有BEV模型上的显著和一致性的改进。具体来说,CLIP-BEVFormer在鸟瞰图元素的NDS和mAP方面的提升分别达到了8.5%和9.2%,超过了最佳现有BEV模型在3D物体检测任务上的水平。
https://arxiv.org/abs/2403.08919
Submodular functions, crucial for various applications, often lack practical learning methods for their acquisition. Seemingly unrelated, learning a scaling from oracles offering graded pairwise preferences (GPC) is underexplored, despite a rich history in psychometrics. In this paper, we introduce deep submodular peripteral networks (DSPNs), a novel parametric family of submodular functions, and methods for their training using a contrastive-learning inspired GPC-ready strategy to connect and then tackle both of the above challenges. We introduce newly devised GPC-style "peripteral" loss which leverages numerically graded relationships between pairs of objects (sets in our case). Unlike traditional contrastive learning, our method utilizes graded comparisons, extracting more nuanced information than just binary-outcome comparisons, and contrasts sets of any size (not just two). We also define a novel suite of automatic sampling strategies for training, including active-learning inspired submodular feedback. We demonstrate DSPNs' efficacy in learning submodularity from a costly target submodular function showing superiority in downstream tasks such as experimental design and streaming applications.
子模块函数,对于各种应用非常重要,通常缺乏获取它们的实际学习方法。似乎不相关的是,从提供分级配对偏好的预言者学习缩放,尽管在心理测量学领域有丰富的历史。在本文中,我们引入了深度子模块边缘网络(DSPNs),一种新颖的参数子模块函数家族,以及使用一种基于对比学习启发的GPC友好策略来训练它们的方法。我们引入了新颖的GPC风格“边缘”损失,该损失利用了对象间数值分级关系(在我们情况下为集合)。与传统对比学习不同,我们的方法利用了分级比较,提取了比仅二进制结果比较更细微的信息,并对比了任何规模的集合(不仅仅是两个)。我们还定义了一组新的自动采样策略来训练,包括基于主动学习的子模块反馈。我们证明了DSPNs在从昂贵目标子模块函数中学习子模块性的有效性,该函数在下游任务(如实验设计和流媒体应用)中表现出卓越的性能。
https://arxiv.org/abs/2403.08199
Recent advancements in biology and chemistry have leveraged multi-modal learning, integrating molecules and their natural language descriptions to enhance drug discovery. However, current pre-training frameworks are limited to two modalities, and designing a unified network to process different modalities (e.g., natural language, 2D molecular graphs, 3D molecular conformations, and 3D proteins) remains challenging due to inherent gaps among them. In this work, we propose MolBind, a framework that trains encoders for multiple modalities through contrastive learning, mapping all modalities to a shared feature space for multi-modal semantic alignment. To facilitate effective pre-training of MolBind on multiple modalities, we also build and collect a high-quality dataset with four modalities, MolBind-M4, including graph-language, conformation-language, graph-conformation, and conformation-protein paired data. MolBind shows superior zero-shot learning performance across a wide range of tasks, demonstrating its strong capability of capturing the underlying semantics of multiple modalities.
近年来,生物学和化学领域的进步利用了多模态学习,将分子及其自然语言描述集成到药物发现中,从而提高了药物发现。然而,目前的预训练框架仅限于两种模态,而且由于它们之间的固有差距,设计一个统一的网络来处理不同模态(例如自然语言、2D分子图、3D分子构象和3D蛋白质)仍然具有挑战性。在本文中,我们提出了MolBind,一种通过对比学习训练多个模态编码器的框架,将所有模态映射到共享特征空间以实现多模态语义对齐。为了方便在多个模态上进行有效的预训练,我们还构建了一个包括四个模态的高质量数据集MolBind-M4,包括图语言、构象语言、图构象和构象蛋白质对数据。MolBind在广泛的任务上都表现出卓越的零 shot学习性能,证明了其对多个模态底层语义捕捉的強大能力。
https://arxiv.org/abs/2403.08167
Self-supervised learning (SSL) is one strategy for addressing the paucity of labelled data in medical imaging by learning representations from unlabelled images. Contrastive and non-contrastive SSL methods produce learned representations that are similar for pairs of related images. Such pairs are commonly constructed by randomly distorting the same image twice. The videographic nature of ultrasound offers flexibility for defining the similarity relationship between pairs of images. In this study, we investigated the effect of utilizing proximal, distinct images from the same B-mode ultrasound video as pairs for SSL. Additionally, we introduced a sample weighting scheme that increases the weight of closer image pairs and demonstrated how it can be integrated into SSL objectives. Named Intra-Video Positive Pairs (IVPP), the method surpassed previous ultrasound-specific contrastive learning methods' average test accuracy on COVID-19 classification with the POCUS dataset by $\ge 1.3\%$. Detailed investigations of IVPP's hyperparameters revealed that some combinations of IVPP hyperparameters can lead to improved or worsened performance, depending on the downstream task. Guidelines for practitioners were synthesized based on the results, such as the merit of IVPP with task-specific hyperparameters, and the improved performance of contrastive methods for ultrasound compared to non-contrastive counterparts.
自监督学习(SSL)是一种通过从未标注图像中学习表示来解决医学影像中标注数据不足的策略。对比式和非对比式 SSL 方法产生的学习表示对于相关图像对是相似的。这种对对通常通过随机扭曲相同图像两次来构建。超声视频的视频性质使得可以定义图像对之间的相似关系。在本文中,我们研究了使用同一 B 模式超声视频的最近、显著图像作为对 SSL 的影响。此外,我们还引入了一个样本加权方案,增加了靠近图像对的权重,并展示了如何将其整合到 SSL 目标函数中。名为 Intra-Video Positive Pairs(IVPP)的方法在 COVID-19 分类任务上超过了之前超声专用的对比学习方法的平均测试准确率,其准确率提高了至少 1.3%。对 IVPP 的超参数的详细调查表明,一些 IVPP 超参数的组合可能会导致性能的改进或恶化,具体取决于下游任务。根据研究结果,我们提出了建议和实践者可以使用的一些指导方针,例如 IVPP 在任务特定超参数方面的优点,以及与非对比式方法相比,超声的对比方法改善的性能。
https://arxiv.org/abs/2403.07715
Personality detection aims to detect one's personality traits underlying in social media posts. One challenge of this task is the scarcity of ground-truth personality traits which are collected from self-report questionnaires. Most existing methods learn post features directly by fine-tuning the pre-trained language models under the supervision of limited personality labels. This leads to inferior quality of post features and consequently affects the performance. In addition, they treat personality traits as one-hot classification labels, overlooking the semantic information within them. In this paper, we propose a large language model (LLM) based text augmentation enhanced personality detection model, which distills the LLM's knowledge to enhance the small model for personality detection, even when the LLM fails in this task. Specifically, we enable LLM to generate post analyses (augmentations) from the aspects of semantic, sentiment, and linguistic, which are critical for personality detection. By using contrastive learning to pull them together in the embedding space, the post encoder can better capture the psycho-linguistic information within the post representations, thus improving personality detection. Furthermore, we utilize the LLM to enrich the information of personality labels for enhancing the detection performance. Experimental results on the benchmark datasets demonstrate that our model outperforms the state-of-the-art methods on personality detection.
人格检测的目的是检测社交媒体帖子中潜在的人格特质。这项任务的挑战是缺乏从自我报告问卷中收集到的真实人格特质数据。大多数现有方法通过在有限的个性标签的监督下微调预训练语言模型来学习post特征。这导致了post特征的质量较差,从而影响了性能。此外,它们将人格特质视为一个一维分类标签,忽视了它们内部的信息。在本文中,我们提出了一个基于文本增强的大型语言模型(LLM)增强人格检测模型,它通过蒸馏LLM的知识来增强人格检测小模型,即使LLM在这个任务上失败。具体来说,我们使LLM能够从语义、情感和语言方面生成post分析(增强),这是人格检测的关键方面。通过使用对比学习将它们在嵌入空间中聚集在一起,post编码器可以更好地捕捉post表示中的心理语言信息,从而提高人格检测。此外,我们利用LLM来丰富人格标签的信息以提高检测性能。在基准数据集上的实验结果表明,我们的模型在人格检测方面超过了最先进的方法。
https://arxiv.org/abs/2403.07581
In the context of single domain generalisation, the objective is for models that have been exclusively trained on data from a single domain to demonstrate strong performance when confronted with various unfamiliar domains. In this paper, we introduce a novel model referred to as Contrastive Uncertainty Domain Generalisation Network (CUDGNet). The key idea is to augment the source capacity in both input and label spaces through the fictitious domain generator and jointly learn the domain invariant representation of each class through contrastive learning. Extensive experiments on two Single Source Domain Generalisation (SSDG) datasets demonstrate the effectiveness of our approach, which surpasses the state-of-the-art single-DG methods by up to $7.08\%$. Our method also provides efficient uncertainty estimation at inference time from a single forward pass through the generator subnetwork.
在单领域通用(single-domain generalization)的背景下,目标是使仅在单一领域训练的模型在面对各种不熟悉的领域时表现出强大的性能。在本文中,我们引入了一种新模型,称为对比性不确定域泛化网络(CUDGNet)。关键思想是通过假设领域生成器在输入和标签空间中扩展源容量,并通过对比学习共同学习每个类的领域不变表示。对两个单源域泛化数据集的实验证明表明,我们的方法的有效性超过了现有单DG方法的水平,最高达到了7.08%。我们的方法还能够在通过生成器子网络的单一次前向传递进行推理时提供高效的不确定性估计。
https://arxiv.org/abs/2403.07514