Foreground segmentation is a fundamental task in computer vision, encompassing various subdivision tasks. Previous research has typically designed task-specific architectures for each task, leading to a lack of unification. Moreover, they primarily focus on recognizing foreground objects without effectively distinguishing them from the background. In this paper, we emphasize the importance of the background and its relationship with the foreground. We introduce FOCUS, the Foreground ObjeCts Universal Segmentation framework that can handle multiple foreground tasks. We develop a multi-scale semantic network using the edge information of objects to enhance image features. To achieve boundary-aware segmentation, we propose a novel distillation method, integrating the contrastive learning strategy to refine the prediction mask in multi-modal feature space. We conduct extensive experiments on a total of 13 datasets across 5 tasks, and the results demonstrate that FOCUS consistently outperforms the state-of-the-art task-specific models on most metrics.
前景分割是计算机视觉中的一个基本任务,涵盖了多种细分任务。以往的研究通常为每个特定任务设计专门的架构,导致了缺乏统一性的问题。此外,它们主要集中在识别前景对象上,而未能有效地区分背景和前景之间的差异。在本文中,我们强调了背景的重要性及其与前景的关系。我们介绍了FOCUS框架(Foreground ObjeCts Universal Segmentation),这是一个能够处理多种前景任务的通用分割框架。我们开发了一种多尺度语义网络,利用物体边缘信息来增强图像特征。为了实现边界感知分割,我们提出了一种新颖的知识蒸馏方法,并结合对比学习策略在多模态特征空间中细化预测掩码。 我们在总计13个数据集上的5项任务上进行了广泛的实验,结果表明FOCUS在大多数指标上持续优于现有的特定任务模型。
https://arxiv.org/abs/2501.05238
As opposed to human drivers, current autonomous driving systems still require vast amounts of labeled data to train. Recently, world models have been proposed to simultaneously enhance autonomous driving capabilities by improving the way these systems understand complex real-world environments and reduce their data demands via self-supervised pre-training. In this paper, we present AD-L-JEPA (aka Autonomous Driving with LiDAR data via a Joint Embedding Predictive Architecture), a novel self-supervised pre-training framework for autonomous driving with LiDAR data that, as opposed to existing methods, is neither generative nor contrastive. Our method learns spatial world models with a joint embedding predictive architecture. Instead of explicitly generating masked unknown regions, our self-supervised world models predict Bird's Eye View (BEV) embeddings to represent the diverse nature of autonomous driving scenes. Our approach furthermore eliminates the need to manually create positive and negative pairs, as is the case in contrastive learning. AD-L-JEPA leads to simpler implementation and enhanced learned representations. We qualitatively and quantitatively demonstrate high-quality of embeddings learned with AD-L-JEPA. We furthermore evaluate the accuracy and label efficiency of AD-L-JEPA on popular downstream tasks such as LiDAR 3D object detection and associated transfer learning. Our experimental evaluation demonstrates that AD-L-JEPA is a plausible approach for self-supervised pre-training in autonomous driving applications and is the best available approach outperforming SOTA, including most recently proposed Occupancy-MAE [1] and ALSO [2]. The source code of AD-L-JEPA is available at this https URL.
与人类驾驶员不同,当前的自动驾驶系统仍然需要大量的标注数据来进行训练。最近,世界模型被提出以同时增强这些系统的理解能力,使其更好地处理复杂的现实环境,并通过自我监督的预训练来减少其对数据的需求。在本文中,我们提出了AD-L-JEPA(即基于LiDAR数据并通过联合嵌入预测架构进行自动驾驶),这是一种新颖的针对自动驾驶中的LiDAR数据的自监督预训练框架,与现有方法不同的是,它既不是生成式的也不是对比式的。我们的方法通过联合嵌入预测架构学习空间世界模型。不同于明确地生成遮蔽未知区域的方式,我们的自监督世界模型会预测俯视图(BEV)嵌入以表示自动驾驶场景的多样性。此外,我们所提出的方法还消除了创建正负样本对的需求,这是对比学习中需要手动完成的任务。因此,AD-L-JEPA简化了实现过程,并提升了学到的表示能力。我们在定性和定量上展示了通过AD-L-JEPA学得的嵌入具有高质量的特点。 为了评估AD-L-JEPA在下游任务中的准确性以及标注效率,我们对包括LiDAR 3D物体检测和相关迁移学习在内的流行任务进行了测试。实验结果表明,AD-L-JEPA是自监督预训练应用于自动驾驶领域的一种可行方法,并且优于现有的最佳方法(SOTA),包括最近提出的Occupancy-MAE [1]和ALSO [2]。 AD-L-JEPA的源代码可以在此网址获取:[此URL]。
https://arxiv.org/abs/2501.04969
Recent advancements in audio deepfake detection have leveraged graph neural networks (GNNs) to model frequency and temporal interdependencies in audio data, effectively identifying deepfake artifacts. However, the reliance of GNN-based methods on substantial labeled data for graph construction and robust performance limits their applicability in scenarios with limited labeled data. Although vast amounts of audio data exist, the process of labeling samples as genuine or fake remains labor-intensive and costly. To address this challenge, we propose SIGNL (Spatio-temporal vIsion Graph Non-contrastive Learning), a novel framework that maintains high GNN performance in low-label settings. SIGNL constructs spatio-temporal graphs by representing patches from the audio's visual spectrogram as nodes. These graph structures are modeled using vision graph convolutional (GC) encoders pre-trained through graph non-contrastive learning, a label-free that maximizes the similarity between positive pairs. The pre-trained encoders are then fine-tuned for audio deepfake detection, reducing reliance on labeled data. Experiments demonstrate that SIGNL outperforms state-of-the-art baselines across multiple audio deepfake detection datasets, achieving the lowest Equal Error Rate (EER) with as little as 5% labeled data. Additionally, SIGNL exhibits strong cross-domain generalization, achieving the lowest EER in evaluations involving diverse attack types and languages in the In-The-Wild dataset.
最近在音频深度伪造检测领域取得的进展利用了图神经网络(GNN)来建模音频数据中的频率和时间依赖性,从而有效识别出深度伪造特征。然而,基于GNN的方法由于需要大量的标注数据来进行图构建,并且为了保持性能稳健还需要大量标签支持,在缺乏充分标注数据的情境下限制了其应用范围。尽管存在海量的音频数据,但对样本进行人工标注以区分真实和伪造的工作量大、成本高。 为解决这一挑战,我们提出了一种新的框架SIGNL(Spatio-temporal vIsion Graph Non-contrastive Learning),该框架在低标签设置下保持了GNN方法的高性能。SIGNL通过将音频视觉光谱图中的补丁表示为节点来构建时空图,并利用预先通过非对比学习训练好的视觉图卷积(GC)编码器来对这些图结构进行建模,这种无标注的方法旨在最大化正样本之间的相似度。然后,预训练的编码器被进一步微调用于音频深度伪造检测任务中,从而减少了对标记数据的需求。 实验表明,SIGNL在多个音频深度伪造数据集上均超越了最先进的基线模型,并且即使仅使用5%的标签数据也能达到最低的等错误率(EER)。此外,SIGNL还表现出强大的跨域泛化能力,在涉及不同攻击类型和语言的In-The-Wild数据集中实现了最低的EER。
https://arxiv.org/abs/2501.04942
Recent advancements in multimodal models have shown a strong ability in visual perception, reasoning abilities, and vision-language understanding. However, studies on visual matching ability are missing, where finding the visual correspondence of objects is essential in vision research. Our research reveals that the matching capabilities in recent multimodal LLMs (MLLMs) still exhibit systematic shortcomings, even with current strong MLLMs models, GPT-4o. In particular, we construct a Multimodal Visual Matching (MMVM) benchmark to fairly benchmark over 30 different MLLMs. The MMVM benchmark is built from 15 open-source datasets and Internet videos with manual annotation. We categorize the data samples of MMVM benchmark into eight aspects based on the required cues and capabilities to more comprehensively evaluate and analyze current MLLMs. In addition, we have designed an automatic annotation pipeline to generate the MMVM SFT dataset, including 220K visual matching data with reasoning annotation. Finally, we present CoLVA, a novel contrastive MLLM with two novel technical designs: fine-grained vision expert with object-level contrastive learning and instruction augmentation strategy. CoLVA achieves 51.06\% overall accuracy (OA) on the MMVM benchmark, surpassing GPT-4o and baseline by 8.41\% and 23.58\% OA, respectively. The results show the effectiveness of our MMVM SFT dataset and our novel technical designs. Code, benchmark, dataset, and models are available at this https URL.
近期在多模态模型上的进展显示了其在视觉感知、推理能力和视听语言理解方面的能力显著提升。然而,关于视觉匹配能力的研究却鲜有涉及,而寻找物体的视觉对应关系对于视觉研究至关重要。我们的研究表明,即使使用当前强大的多模态大型语言模型(MLLMs),如GPT-4o,它们的匹配能力依然存在系统性的不足。 为此,我们构建了一个名为“多模态视觉匹配”(MMVM)的基准测试集,用以公平地评估超过30种不同的MLLMs。MMVM基准测试集来源于15个开源数据集和互联网视频,并带有手动注释。根据所需线索和能力的不同,我们将MMVM基准测试集的数据样本分为八个方面进行更全面的评价与分析。 此外,我们还设计了一条自动化注释流水线来生成包含220K视觉匹配数据及推理注释的MMVM SFT数据集。 最后,我们提出了一种新的对比MLLM模型——CoLVA。该模型采用了两种创新技术设计:细粒度视觉专家与对象级对比学习以及指令增强策略。在MMVM基准测试上,CoLVA实现了51.06%的整体准确率(OA),分别比GPT-4o和基线模型高出8.41%和23.58%的OA。实验结果证明了我们设计的MMVM SFT数据集以及新技术方案的有效性。 相关代码、基准测试、数据集及模型可在该链接中获取:[此链接](请替换为实际可用的网址)。
https://arxiv.org/abs/2501.04670
Artificial Intelligence is revolutionizing medical practice, enhancing diagnostic accuracy and healthcare delivery. However, its adaptation in medical settings still faces significant challenges, related to data availability and privacy constraints. Synthetic data has emerged as a promising solution to mitigate these issues, addressing data scarcity while preserving privacy. Recently, Latent Diffusion Models have emerged as a powerful tool for generating high-quality synthetic data. Meanwhile, the integration of different modalities has gained interest, emphasizing the need of models capable of handle multimodal medical this http URL approaches struggle to integrate complementary information and lack the ability to generate modalities simultaneously. To address this challenge, we present MedCoDi-M, a 6.77-billion-parameter model, designed for multimodal medical data generation, that, following Foundation Model paradigm, exploits contrastive learning and large quantity of data to build a shared latent space which capture the relationships between different data modalities. Further, we introduce the Multi-Prompt training technique, which significantly boosts MedCoDi-M's generation under different settings. We extensively validate MedCoDi-M: first we benchmark it against five competitors on the MIMIC-CXR dataset, a state-of-the-art dataset for Chest X-ray and radiological report generation. Secondly, we perform a Visual Turing Test with expert radiologists to assess the realism and clinical relevance of the generated data, ensuring alignment with real-world scenarios. Finally, we assess the utility of MedCoDi-M in addressing key challenges in the medical field, such as anonymization, data scarcity and imbalance learning. The results are promising, demonstrating the applicability of MedCoDi-M in medical contexts. Project page is at this https URL.
人工智能正在革新医学实践,通过提高诊断准确性并改善医疗服务来带来变革。然而,在医疗环境中应用人工智能仍然面临重大挑战,这些问题主要与数据可用性和隐私限制相关。合成数据作为一种有前景的解决方案应运而生,它不仅能够解决数据不足的问题,还能在生成过程中保护患者隐私。 最近,潜在扩散模型作为生成高质量合成数据的强大工具崭露头角。与此同时,多模态集成的需求日益增长,然而现有的方法难以整合互补信息且无法同时生成多种模态的数据。为了解决这一挑战,我们提出了MedCoDi-M,一个具有67亿参数的模型,专门用于处理多模态医疗数据生成任务。该模型遵循基础模型范式,利用对比学习和大量数据来构建共享潜在空间以捕捉不同数据模式之间的关系。 此外,我们引入了Multi-Prompt训练技术,这一技术显著提升了MedCoDi-M在各种设置下的生成能力。我们在MIMIC-CXR数据集上对其进行了广泛的验证:首先,我们将该模型与五个竞争对手进行基准测试,这是一个用于胸部X光片和放射学报告生成的最新数据集;其次,我们通过专家放射科医生进行视觉图灵测试来评估生成数据的真实性和临床相关性,以确保其符合现实场景中的需求;最后,我们还评估了MedCoDi-M在解决医疗领域关键挑战(如匿名化、数据稀缺和学习不平衡)方面的效用。 结果显示,MedCoDi-M具有在医学情境中应用的潜力。项目页面位于此链接:[假设此处应插入具体网址]。
https://arxiv.org/abs/2501.04614
Over the last decade, representation learning, which embeds complex information extracted from large amounts of data into dense vector spaces, has emerged as a key technique in machine learning. Among other applications, it has been a key building block for large language models and advanced computer vision systems based on contrastive learning. A core component of representation learning systems is the projection head, which maps the original embeddings into different, often compressed spaces, while preserving the similarity relationship between vectors. In this paper, we propose a quantum-inspired projection head that includes a corresponding quantum-inspired similarity metric. Specifically, we map classical embeddings onto quantum states in Hilbert space and introduce a quantum circuit-based projection head to reduce embedding dimensionality. To evaluate the effectiveness of this approach, we extended the BERT language model by integrating our projection head for embedding compression. We compared the performance of embeddings, which were compressed using our quantum-inspired projection head, with those compressed using a classical projection head on information retrieval tasks using the TREC 2019 and TREC 2020 Deep Learning benchmarks. The results demonstrate that our quantum-inspired method achieves competitive performance relative to the classical method while utilizing 32 times fewer parameters. Furthermore, when trained from scratch, it notably excels, particularly on smaller datasets. This work not only highlights the effectiveness of the quantum-inspired approach but also emphasizes the utility of efficient, ad hoc low-entanglement circuit simulations within neural networks as a powerful quantum-inspired technique.
在过去十年中,表示学习作为一种关键技术在机器学习领域崭露头角。它将从大量数据中提取的复杂信息嵌入到密集向量空间中。除了其他应用外,它已成为大型语言模型和基于对比学习的高级计算机视觉系统的重要构建模块之一。表示学习系统的核心组成部分是投影头部(projection head),它将原始嵌入映射到不同的、通常是压缩的空间,并保持向量之间的相似性关系。在这篇论文中,我们提出了一种受量子启发的投影头部,其中包括一种相应的量子启发式相似度度量方法。具体而言,我们将经典嵌入映射到希尔伯特空间中的量子态,并引入基于量子电路的投影头部以减少嵌入维度。 为了评估这种方法的有效性,我们在BERT语言模型中扩展了我们的投影头以实现嵌入压缩功能,并将其与使用经典投影头进行嵌入压缩的效果进行了对比。我们通过TREC 2019和TREC 2020深度学习基准的信息检索任务来比较两种方法的性能。结果表明,尽管参数量减少了32倍,我们的量子启发式方法在表现上可以与传统方法相媲美;同时,在从头开始训练时,特别是在小数据集上的效果更为显著。 这项工作不仅突显了量子启发式的有效性,而且还强调了在神经网络中使用高效的、临时的低纠缠电路模拟作为一种强大的量子启发技术的重要性。
https://arxiv.org/abs/2501.04591
ChartQA presents significant challenges due to the complex distribution of chart elements and the implicit patterns embedded within the underlying data. In this chapter, we have developed a joint multimodal scene graph for charts, explicitly representing the relationships between chart elements and their associated patterns. Our proposed multimodal scene graph consists of two components: a visual graph and a textual graph, each designed to capture the structural and semantic information within the chart. To unify representations across these different modalities, we introduce a multimodal graph contrastive learning approach that learns unified representations by maximizing similarity between nodes representing the same object across multimodal graphs. The learned graph representations can be seamlessly incorporated into a transformer decoder as a soft prompt. Additionally, given the growing need for Multimodal Large Language Models (MLLMs) in zero-shot scenarios, we have designed Chain-of-Thought (CoT) prompts for MLLMs to reduce hallucinations. We tested both methods on public benchmarks such as ChartQA, OpenCQA, and ChartX, demonstrating improved performance and validating the effectiveness of our proposed methods.
ChartQA 的呈现带来了挑战,因为图表元素的分布复杂且底层数据中嵌入了隐含模式。在本章中,我们开发了一种用于图表的联合多模态场景图,这种图明确表示了图表元素及其关联模式之间的关系。我们提出的多模态场景图由两个部分组成:一个视觉图和一个文本图,每个部分都旨在捕捉图表中的结构化和语义信息。为了在这些不同的模式之间统一表示形式,我们引入了一种多模态图对比学习方法,通过最大化跨多模态图中代表同一对象的节点之间的相似性来学习统一的表示形式。所学得的图表示可以无缝地整合到一个变压器解码器作为软提示使用。此外,鉴于在零样本场景下对多模态大型语言模型(MLLMs)的需求日益增长,我们为 MLLMs 设计了 Chain-of-Thought (CoT) 提示以减少幻觉现象的发生。我们在 ChartQA、OpenCQA 和 ChartX 等公开基准测试中测试了这两种方法,展示了改进的性能并验证了所提出方法的有效性。 这段文字描述了一个用于图表理解的研究方案,介绍了如何通过联合多模态场景图来解决复杂分布和隐含模式所带来的挑战,并提出了两种提升模型性能的方法:一种是通过多模态图对比学习提高表示的一致性和有效性;另一种是设计 Chain-of-Thought (CoT) 提示以减少大型语言模型在零样本任务中的幻觉现象。这两种方法都经过了公开基准测试的验证,显示出了良好的效果和应用潜力。
https://arxiv.org/abs/2501.04303
Recent advancements in vision foundation models (VFMs) have revolutionized visual perception in 2D, yet their potential for 3D scene understanding, particularly in autonomous driving applications, remains underexplored. In this paper, we introduce LargeAD, a versatile and scalable framework designed for large-scale 3D pretraining across diverse real-world driving datasets. Our framework leverages VFMs to extract semantically rich superpixels from 2D images, which are aligned with LiDAR point clouds to generate high-quality contrastive samples. This alignment facilitates cross-modal representation learning, enhancing the semantic consistency between 2D and 3D data. We introduce several key innovations: i) VFM-driven superpixel generation for detailed semantic representation, ii) a VFM-assisted contrastive learning strategy to align multimodal features, iii) superpoint temporal consistency to maintain stable representations across time, and iv) multi-source data pretraining to generalize across various LiDAR configurations. Our approach delivers significant performance improvements over state-of-the-art methods in both linear probing and fine-tuning tasks for both LiDAR-based segmentation and object detection. Extensive experiments on eleven large-scale multi-modal datasets highlight our superior performance, demonstrating the adaptability, efficiency, and robustness in real-world autonomous driving scenarios.
最近,视觉基础模型(VFMs)在二维视觉感知领域的进展已经彻底革新了这一领域,然而它们在三维场景理解中的潜力,尤其是在自动驾驶应用中,仍未被充分探索。在这篇论文中,我们介绍了LargeAD,这是一个为大规模3D预训练设计的多功能、可扩展框架,适用于各种现实世界驾驶数据集。我们的框架利用VFMs从2D图像中提取语义丰富的超像素,并将其与激光雷达点云对齐以生成高质量的对比样本。这种对齐促进了跨模态表示学习,增强了二维和三维数据之间的语义一致性。 我们引入了几项关键创新: i) 由VFM驱动的超像素生成,用于详细的语义表示; ii) 一种辅助VFM进行的对比学习策略,以对齐多模式特征; iii) 超点时间一致性,以保持跨时间的稳定表示; iv) 多源数据预训练,以适应各种激光雷达配置。 我们的方法在基于LiDAR的分割和目标检测任务中的线性探测和微调任务上都比最先进的方法表现出显著的性能改进。我们在十一项大规模多模态数据集上的广泛实验中证明了我们方法的优越性能,展示了其在现实世界自动驾驶场景中的适应性、效率和鲁棒性。
https://arxiv.org/abs/2501.04005
Recent research has demonstrated that Large Language Models (LLMs) are not limited to text-only tasks but can also function as multimodal models across various modalities, including audio, images, and videos. In particular, research on 3D Large Multimodal Models (3D LMMs) is making notable strides, driven by the potential of processing higher-dimensional data like point clouds. However, upon closer examination, we find that the visual and textual content within each sample of existing training datasets lacks both high informational granularity and clarity, which serve as a bottleneck for precise cross-modal understanding. To address these issues, we propose CL3DOR, Contrastive Learning for 3D large multimodal models via Odds ratio on high-Resolution point clouds, designed to ensure greater specificity and clarity in both visual and textual content. Specifically, we increase the density of point clouds per object and construct informative hard negative responses in the training dataset to penalize unwanted responses. To leverage hard negative responses, we incorporate the odds ratio as an auxiliary term for contrastive learning into the conventional language modeling loss. CL3DOR achieves state-of-the-art performance in 3D scene understanding and reasoning benchmarks. Additionally, we demonstrate the effectiveness of CL3DOR's key components through extensive experiments.
最近的研究表明,大型语言模型(LLMs)不仅限于处理文本任务,还可以作为多模态模型在音频、图像和视频等多种模式下运作。特别是关于三维大规模多模态模型(3D LMMs)的研究正在取得显著进展,这得益于处理点云等高维数据的潜力。然而,仔细观察现有训练数据集中的每个样本后发现,其中的视觉和文本内容缺乏高水平的信息粒度和清晰度,这对精准跨模态理解构成了瓶颈。为解决这些问题,我们提出了CL3DOR(通过在高分辨率点云上使用似然比进行三维大规模多模态模型对比学习),旨在确保训练数据集中每个样本的视觉和文本内容具有更高的具体性和清晰度。特别地,我们将每个多视图对象的点云密度增加,并构建出更具信息量的硬负例响应以惩罚不必要的反应。为了利用这些硬负例响应,我们引入了似然比作为对比学习中的辅助项,将其整合进传统的语言模型损失函数中。CL3DOR在三维场景理解和推理基准测试中取得了最先进的性能。此外,通过广泛的实验研究验证了CL3DOR关键组件的有效性。
https://arxiv.org/abs/2501.03879
Product Attribute Value Identification (PAVI) involves identifying attribute values from product profiles, a key task for improving product search, recommendations, and business analytics on e-commerce platforms. However, existing PAVI methods face critical challenges, such as inferring implicit values, handling out-of-distribution (OOD) values, and producing normalized outputs. To address these limitations, we introduce Taxonomy-Aware Contrastive Learning Retrieval (TACLR), the first retrieval-based method for PAVI. TACLR formulates PAVI as an information retrieval task by encoding product profiles and candidate values into embeddings and retrieving values based on their similarity to the item embedding. It leverages contrastive training with taxonomy-aware hard negative sampling and employs adaptive inference with dynamic thresholds. TACLR offers three key advantages: (1) it effectively handles implicit and OOD values while producing normalized outputs; (2) it scales to thousands of categories, tens of thousands of attributes, and millions of values; and (3) it supports efficient inference for high-load industrial scenarios. Extensive experiments on proprietary and public datasets validate the effectiveness and efficiency of TACLR. Moreover, it has been successfully deployed in a real-world e-commerce platform, processing millions of product listings daily while supporting dynamic, large-scale attribute taxonomies.
产品属性值识别(PAVI)涉及从产品资料中提取属性值,这对于改善电子商务平台上的产品搜索、推荐和商业分析至关重要。然而,现有的PAVI方法面临着一些关键挑战,例如推断隐含值、处理分布外(OOD)值以及生成标准化输出。为了解决这些限制,我们引入了基于分类学感知对比学习检索(TACLR),这是首个针对PAVI的检索类方法。 TACLR通过将产品资料和候选值编码成嵌入,并根据与项目嵌入的相似性来检索值,从而将PAVI表述为一个信息检索任务。它利用带有分类学感知硬负样本采样的对比训练,并使用动态阈值进行自适应推理。TACLR提供了三个关键优势:(1)它可以有效地处理隐含和OOD值并生成标准化输出;(2)它可扩展到数千个类别,数万个属性以及数百万个值;(3)它支持高效推断以应对高负载的工业场景。 在专有和公共数据集上进行的广泛实验验证了TACLR的有效性和效率。此外,它已经在实际电子商务平台上成功部署,在每天处理数百万条产品列表的同时还支持动态的大规模属性分类学。
https://arxiv.org/abs/2501.03835
Action Quality Assessment (AQA), which aims at automatic and fair evaluation of athletic performance, has gained increasing attention in recent years. However, athletes are often in rapid movement and the corresponding visual appearance variances are subtle, making it challenging to capture fine-grained pose differences and leading to poor estimation performance. Furthermore, most common AQA tasks, such as diving in sports, are usually divided into multiple sub-actions, each of which contains different durations. However, existing methods focus on segmenting the video into fixed frames, which disrupts the temporal continuity of sub-actions resulting in unavoidable prediction errors. To address these challenges, we propose a novel action quality assessment method through hierarchically pose-guided multi-stage contrastive regression. Firstly, we introduce a multi-scale dynamic visual-skeleton encoder to capture fine-grained spatio-temporal visual and skeletal features. Then, a procedure segmentation network is introduced to separate different sub-actions and obtain segmented features. Afterwards, the segmented visual and skeletal features are both fed into a multi-modal fusion module as physics structural priors, to guide the model in learning refined activity similarities and variances. Finally, a multi-stage contrastive learning regression approach is employed to learn discriminative representations and output prediction results. In addition, we introduce a newly-annotated FineDiving-Pose Dataset to improve the current low-quality human pose labels. In experiments, the results on FineDiving and MTL-AQA datasets demonstrate the effectiveness and superiority of our proposed approach. Our source code and dataset are available at this https URL.
动作质量评估(AQA)旨在自动且公平地评价体育表现,近年来受到了越来越多的关注。然而,运动员通常处于快速运动状态,并且相应的视觉外观差异细微,这使得捕捉细微的动作姿态变化变得具有挑战性,从而导致估计效果不佳。此外,大多数常见的AQA任务,例如跳水项目,通常被划分为多个子动作,每个子动作包含不同的持续时间。然而,现有的方法专注于将视频分割为固定帧数,这一过程破坏了子动作的时序连贯性,不可避免地产生了预测误差。 为了应对这些挑战,我们提出了一种基于分层姿态引导多阶段对比回归的动作质量评估新方法。首先,引入了一个多尺度动态视觉-骨架编码器来捕捉细微的空间和时间视觉及骨骼特征。然后,通过引入一个过程分割网络将不同的子动作分离,并获得分割后的特征。之后,分割的视觉和骨骼特征都被输入一个多模态融合模块中作为物理结构先验知识,以引导模型学习精炼的动作相似性和差异性。最后,采用多阶段对比学习回归方法来学习区分性的表示并向外输出预测结果。 此外,我们还引入了一个新的注释数据集FineDiving-Pose,旨在提高当前低质量的人体姿态标签的质量。实验结果显示,在FineDiving和MTL-AQA数据集上,我们的方法显示出其有效性和优越性。我们的源代码和数据集可以在此URL获取(原文中的链接无法直接展示)。
https://arxiv.org/abs/2501.03674
Contrastive learning has gained significant attention in short text clustering, yet it has an inherent drawback of mistakenly identifying samples from the same category as negatives and then separating them in the feature space (false negative separation), which hinders the generation of superior representations. To generate more discriminative representations for efficient clustering, we propose a novel short text clustering method, called Discriminative Representation learning via \textbf{A}ttention-\textbf{E}nhanced \textbf{C}ontrastive \textbf{L}earning for Short Text Clustering (\textbf{AECL}). The \textbf{AECL} consists of two modules which are the pseudo-label generation module and the contrastive learning module. Both modules build a sample-level attention mechanism to capture similarity relationships between samples and aggregate cross-sample features to generate consistent representations. Then, the former module uses the more discriminative consistent representation to produce reliable supervision information for assist clustering, while the latter module explores similarity relationships and consistent representations optimize the construction of positive samples to perform similarity-guided contrastive learning, effectively addressing the false negative separation issue. Experimental results demonstrate that the proposed \textbf{AECL} outperforms state-of-the-art methods. If the paper is accepted, we will open-source the code.
对比学习在短文本聚类中获得了广泛关注,但它有一个内在的缺点:可能会错误地将同一类别中的样本识别为负例,并将其在特征空间中分离(假阴性分离),这阻碍了生成优质表示的能力。为了生成更具有区分性的表示以实现高效的聚类,我们提出了一种新的短文本聚类方法,称为通过注意力增强对比学习的判别式表示学习(Attention-Enhanced Contrastive Learning for Short Text Clustering, AECL)。AECL包含两个模块:伪标签生成模块和对比学习模块。这两个模块都构建了样本级别的注意机制来捕捉样本之间的相似关系,并聚合跨样本特征以生成一致的表示。前者使用更具区分性的稳定表示生成可靠的信息监督,辅助聚类;后者探索相似关系,并通过优化正例构造执行指导相似度的学习,有效解决了假阴性分离问题。实验结果表明,所提出的AECL优于最先进的方法。如果论文被接受,我们将开源代码。
https://arxiv.org/abs/2501.03584
Evaluating image captions typically relies on reference captions, which are costly to obtain and exhibit significant diversity and subjectivity. While reference-free evaluation metrics have been proposed, most focus on cross-modal evaluation between captions and images. Recent research has revealed that the modality gap generally exists in the representation of contrastive learning-based multi-modal systems, undermining the reliability of cross-modality metrics like CLIPScore. In this paper, we propose CAMScore, a cyclic reference-free automatic evaluation metric for image captioning models. To circumvent the aforementioned modality gap, CAMScore utilizes a text-to-image model to generate images from captions and subsequently evaluates these generated images against the original images. Furthermore, to provide fine-grained information for a more comprehensive evaluation, we design a three-level evaluation framework for CAMScore that encompasses pixel-level, semantic-level, and objective-level perspectives. Extensive experiment results across multiple benchmark datasets show that CAMScore achieves a superior correlation with human judgments compared to existing reference-based and reference-free metrics, demonstrating the effectiveness of the framework.
评估图像描述通常依赖于参考描述,这些描述的获取成本高昂,并且具有显著的多样性和主观性。尽管已经提出了无参考评估指标,但大多数指标都侧重于描述和图像之间的跨模态评估。近期研究揭示了基于对比学习的多模态系统中普遍存在模式差距(modality gap),这削弱了像CLIPScore这样的跨模态度量标准的可靠性。 本文提出了一种新的无参考自动评价方法CAMScore,用于图像描述模型。为了克服上述提到的模式差距问题,CAMScore采用了一个文本到图像生成模型,根据给定的文字描述生成相应的图片,并随后将这些生成出来的图片与原始图片进行比较评估。此外,为了提供更精细的信息以实现更加全面的评估,我们设计了一套多层次评估框架来应用在CAMScore上,这套框架包括像素级、语义级和客观级三个层面。 通过多个基准数据集上的广泛实验结果表明,CAMScore相比现有基于参考的和无参考的度量标准而言,在与人类判断的相关性方面表现更为优越,从而证明了该框架的有效性和实用性。
https://arxiv.org/abs/2501.03567
Self-supervised learning (SSL) has significantly advanced image representation learning, yet efficiency challenges persist, particularly with adversarial training. Many SSL methods require extensive epochs to achieve convergence, a demand further amplified in adversarial settings. To address this inefficiency, we revisit the robust EMP-SSL framework, emphasizing the importance of increasing the number of crops per image to accelerate learning. Unlike traditional contrastive learning, robust EMP-SSL leverages multi-crop sampling, integrates an invariance term and regularization, and reduces training epochs, enhancing time efficiency. Evaluated with both standard linear classifiers and multi-patch embedding aggregation, robust EMP-SSL provides new insights into SSL evaluation strategies. Our results show that robust crop-based EMP-SSL not only accelerates convergence but also achieves a superior balance between clean accuracy and adversarial robustness, outperforming multi-crop embedding aggregation. Additionally, we extend this approach with free adversarial training in Multi-Crop SSL, introducing the Cost-Free Adversarial Multi-Crop Self-Supervised Learning (CF-AMC-SSL) method. CF-AMC-SSL demonstrates the effectiveness of free adversarial training in reducing training time while simultaneously improving clean accuracy and adversarial robustness. These findings underscore the potential of CF-AMC-SSL for practical SSL applications. Our code is publicly available at this https URL.
自监督学习(Self-supervised Learning,SSL)在图像表示学习方面取得了显著进展,但仍然存在效率挑战,尤其是在对抗训练中。许多SSL方法需要大量的迭代周期才能达到收敛状态,而在对抗环境中这一需求进一步增加。为了应对这种低效问题,我们重新审视了鲁棒的EMP-SSL框架,并强调通过增加每张图片的采样数量来加速学习过程的重要性。 与传统的对比学习不同,鲁棒的EMP-SSL利用多作物抽样、整合不变性项和正则化,并减少了训练周期,从而提高了时间效率。该方法在标准线性分类器和多补丁嵌入聚合评估中提供了关于SSL评价策略的新见解。我们的实验结果显示,基于农作物的鲁棒EMP-SSL不仅加速了收敛过程,还实现了干净准确性和对抗健壮性的最佳平衡,优于多作物嵌入聚合。 此外,我们还在多作物SSL中引入了一种自由对抗训练的方法——成本免费对抗多作物自监督学习(Cost-Free Adversarial Multi-Crop Self-Supervised Learning, CF-AMC-SSL)。CF-AMC-SSL证明了自由对抗训练在减少训练时间的同时能够提升干净准确性和对抗健壮性。这些发现强调了CF-AMC-SSL在实际SSL应用中的潜力。 我们的代码可以在以下URL公开获取:[此链接](请将“this https URL”替换为实际的URL)。
https://arxiv.org/abs/2501.03507
Satisfactory progress has been achieved recently in universal segmentation of CT images. Following the success of vision-language methods, there is a growing trend towards utilizing text prompts and contrastive learning to develop universal segmentation models. However, there exists a significant imbalance in information density between 3D images and text prompts. Moreover, the standard fully connected layer segmentation approach faces significant challenges in handling multiple classes and exhibits poor generalizability. To address these challenges, we propose the VOxel Interacting with LAnguage method (VOILA) for universal CT image segmentation. Initially, we align voxels and language into a shared representation space and classify voxels on the basis of cosine similarity. Subsequently, we develop the Voxel-Language Interaction framework to mitigate the impact of class imbalance caused by foreground-background discrepancies and variations in target volumes. Furthermore, a Complexity-Aware Sampling method is proposed to focus on region hard to segment, achieved by generating pseudo-heatmaps from a trainable Gaussian mixture distribution. Our results indicate the proposed VOILA is capable to achieve improved performance with reduced parameters and computational cost during training. Furthermore, it demonstrates significant generalizability across diverse datasets without additional fine-tuning.
最近,在CT图像的通用分割方面取得了令人满意的进展。随着视觉语言方法的成功应用,越来越多的趋势倾向于利用文本提示和对比学习来开发通用分割模型。然而,三维图像与文本提示之间存在着信息密度的重大不平衡问题。此外,标准的全连接层分割方法在处理多个类别时面临重大挑战,并且泛化能力较差。为了解决这些问题,我们提出了VOxel Interacting with LAnguage(VOILA)方法用于CT图像的通用分割。 首先,我们将体素和语言对齐到共享表示空间中,并基于余弦相似度来分类体素。随后,我们开发了Voxel-Language Interaction框架以缓解由于前景-背景差异及目标体积变化所导致的类别不平衡的影响。此外,还提出了一种复杂性感知采样方法,通过从可训练的高斯混合分布生成伪热图来聚焦于难以分割的区域。 我们的结果显示,提出的VOILA方法能够在减少参数和计算成本的同时提高性能。此外,在不进行额外微调的情况下,它在多种数据集上表现出显著的泛化能力。
https://arxiv.org/abs/2501.03482
Self-supervised learning (SSL) has emerged as a crucial technique in image processing, encoding, and understanding, especially for developing today's vision foundation models that utilize large-scale datasets without annotations to enhance various downstream tasks. This study introduces a novel SSL approach, Information-Maximized Soft Variable Discretization (IMSVD), for image representation learning. Specifically, IMSVD softly discretizes each variable in the latent space, enabling the estimation of their probability distributions over training batches and allowing the learning process to be directly guided by information measures. Motivated by the MultiView assumption, we propose an information-theoretic objective function to learn transform-invariant, non-travail, and redundancy-minimized representation features. We then derive a joint-cross entropy loss function for self-supervised image representation learning, which theoretically enjoys superiority over the existing methods in reducing feature redundancy. Notably, our non-contrastive IMSVD method statistically performs contrastive learning. Extensive experimental results demonstrate the effectiveness of IMSVD on various downstream tasks in terms of both accuracy and efficiency. Thanks to our variable discretization, the embedding features optimized by IMSVD offer unique explainability at the variable level. IMSVD has the potential to be adapted to other learning paradigms. Our code is publicly available at this https URL.
自监督学习(Self-supervised Learning,SSL)已成为图像处理、编码和理解领域中的关键技术,特别是在开发当今利用大规模无标注数据集来增强各种下游任务的视觉基础模型方面。本研究提出了一种新颖的SSL方法——信息最大化软变量离散化(Information-Maximized Soft Variable Discretization, IMSVD),用于图像表示学习。具体来说,IMSVD在隐空间中对每个变量进行软性离散化处理,这使得能够估计训练批次中的概率分布,并允许通过信息度量直接指导学习过程。 受多视图假设的启发,我们提出了一种基于信息理论的目标函数来学习变换不变、非冗余且最小冗余表示特征。随后,我们推导出一种联合交叉熵损失函数用于自监督图像表示学习,在理论上比现有方法更能减少特征冗余度。值得注意的是,我们的非对比式IMSVD方法在统计上能够实现对比学习的效果。 广泛的实验结果证明了IMSVD在各种下游任务中的有效性和高效性,无论是在准确性还是效率方面均表现出色。由于我们对变量进行离散化处理,通过IMSVD优化的嵌入特征提供了独特的可解释性,特别是在变量层面。此外,IMSVD具有适应其他学习范式的潜力。 我们的代码已公开发布在 [此处](https://this https URL)(请将URL替换为实际链接)。
https://arxiv.org/abs/2501.03469
Binaural audio generation (BAG) aims to convert monaural audio to stereo audio using visual prompts, requiring a deep understanding of spatial and semantic information. However, current models risk overfitting to room environments and lose fine-grained spatial details. In this paper, we propose a new audio-visual binaural generation model incorporating an audio-visual conditional normalisation layer that dynamically aligns the mean and variance of the target difference audio features using visual context, along with a new contrastive learning method to enhance spatial sensitivity by mining negative samples from shuffled visual features. We also introduce a cost-efficient way to utilise test-time augmentation in video data to enhance performance. Our approach achieves state-of-the-art generation accuracy on the FAIR-Play and MUSIC-Stereo benchmarks.
双耳音频生成(BAG)旨在通过视觉提示将单声道音频转换为立体声音频,这需要对空间和语义信息有深入的理解。然而,当前的模型存在过度拟合到房间环境的风险,并且会丢失精细的空间细节。在本文中,我们提出了一种新的音视频双耳生成模型,该模型结合了一个音视频条件规范化层,该层能够利用视觉上下文动态地调整目标差分音频特征的均值和方差。此外,我们还引入了一种新的对比学习方法,通过从打乱后的视觉特征中挖掘负样本来增强空间敏感性。同时,我们也提出了一种成本效益高的方式,在视频数据测试时采用数据增强以提升性能。我们的方法在FAIR-Play和MUSIC-Stereo基准测试上实现了最先进的生成精度。
https://arxiv.org/abs/2501.02786
With the growing significance of network security, the classification of encrypted traffic has emerged as an urgent challenge. Traditional byte-based traffic analysis methods are constrained by the rigid granularity of information and fail to fully exploit the diverse correlations between bytes. To address these limitations, this paper introduces MH-Net, a novel approach for classifying network traffic that leverages multi-view heterogeneous traffic graphs to model the intricate relationships between traffic bytes. The essence of MH-Net lies in aggregating varying numbers of traffic bits into multiple types of traffic units, thereby constructing multi-view traffic graphs with diverse information granularities. By accounting for different types of byte correlations, such as header-payload relationships, MH-Net further endows the traffic graph with heterogeneity, significantly enhancing model performance. Notably, we employ contrastive learning in a multi-task manner to strengthen the robustness of the learned traffic unit representations. Experiments conducted on the ISCX and CIC-IoT datasets for both the packet-level and flow-level traffic classification tasks demonstrate that MH-Net achieves the best overall performance compared to dozens of SOTA methods.
随着网络安全重要性的日益增长,加密流量的分类已成为一个紧迫的挑战。传统的基于字节的数据包分析方法受到信息粒度刚性限制的约束,无法充分利用字节之间的各种关联关系。为了解决这些问题,本文引入了MH-Net,这是一种利用多视角异构流量图来建模数据包之间复杂关系的新方法。MH-Net的核心在于聚合不同数量的流量位,以构建具有多种信息粒度类型的多视图流量图。通过考虑不同的字节关联类型(如头部负载关系),MH-Net进一步赋予流量图异质性,显著提升了模型性能。 值得注意的是,我们采用对比学习方法进行多任务处理,以此增强所学流量单元表示的鲁棒性。在ISCX和CIC-IoT数据集上针对包级与流级网络流量分类任务开展的实验表明,MH-Net相较于数十种最新最优方法(SOTA),取得了最佳的整体性能。
https://arxiv.org/abs/2501.03279
The goal of video moment retrieval and highlight detection is to identify specific segments and highlights based on a given text query. With the rapid growth of video content and the overlap between these tasks, recent works have addressed both simultaneously. However, they still struggle to fully capture the overall video context, making it challenging to determine which words are most relevant. In this paper, we present a novel Video Context-aware Keyword Attention module that overcomes this limitation by capturing keyword variation within the context of the entire video. To achieve this, we introduce a video context clustering module that provides concise representations of the overall video context, thereby enhancing the understanding of keyword dynamics. Furthermore, we propose a keyword weight detection module with keyword-aware contrastive learning that incorporates keyword information to enhance fine-grained alignment between visual and textual features. Extensive experiments on the QVHighlights, TVSum, and Charades-STA benchmarks demonstrate that our proposed method significantly improves performance in moment retrieval and highlight detection tasks compared to existing approaches. Our code is available at: this https URL
视频片段检索和亮点检测的目标是根据给定的文本查询来识别特定的段落和高亮部分。随着视频内容的快速增长以及这两项任务之间的重叠,最近的研究工作已经尝试同时解决这两个问题。然而,这些方法仍然难以全面捕捉整个视频的上下文信息,从而使得确定哪些词汇最相关变得具有挑战性。在本文中,我们提出了一种新颖的“视频上下文感知关键词注意力”模块,该模块通过在整个视频背景下捕捉关键词的变化来克服这一限制。为了实现这一点,我们引入了一个视频上下文聚类模块,它提供了整个视频背景的简洁表示形式,从而增强了对关键词动态的理解。此外,我们还提出了一个带有关键词感知对比学习的关键词权重检测模块,通过将关键词信息整合到其中,以增强视觉和文本特征之间细粒度的一致性。在QVHighlights、TVSum以及Charades-STA基准数据集上的广泛实验表明,与现有方法相比,我们的方法在视频片段检索和亮点检测任务中显著提高了性能。我们的代码可在以下网址获取:this https URL
https://arxiv.org/abs/2501.02504
Contrastive learning, a prominent approach within self-supervised learning, has demonstrated significant effectiveness in developing generalizable models for various applications involving natural images. However, recent research indicates that these successes do not necessarily extend to the medical imaging domain. In this paper, we investigate the reasons for this suboptimal performance and hypothesize that the dense distribution of medical images poses challenges to the pretext tasks in contrastive learning, particularly in constructing positive and negative pairs. We explore model performance under different augmentation strategies and compare the results to those achieved with strong augmentations. Our study includes six publicly available datasets covering multiple clinically relevant tasks. We further assess the model's generalizability through external evaluations. The model pre-trained with weak augmentation outperforms those with strong augmentation, improving AUROC from 0.838 to 0.848 and AUPR from 0.523 to 0.597 on MESSIDOR2, and showing similar enhancements across other datasets. Our findings suggest that optimizing the scale of augmentation is critical for enhancing the efficacy of contrastive learning in medical imaging.
对比学习是自监督学习中的一种重要方法,在开发用于涉及自然图像的各种应用的泛化模型方面表现出显著的有效性。然而,最近的研究表明,这些成功并不一定可以扩展到医学成像领域。在本文中,我们探讨了这种次优表现的原因,并假设医学图像密集分布对对比学习中的预设任务构成了挑战,尤其是在构建正负样本对时。我们研究了不同增强策略下的模型性能,并将其结果与强增强的结果进行了比较。我们的研究包括六个公开可用的数据集,涵盖了多个临床相关的任务。我们进一步通过外部评估来评估模型的泛化能力。使用弱增强进行预训练的模型优于那些使用强增强的模型,在MESSIDOR2数据集中,AUROC从0.838提高到0.848,AUPR从0.523提升至0.597,并且在其他数据集上也显示出类似的改进。我们的研究结果表明,优化增强程度对于提高医学图像对比学习的效率至关重要。
https://arxiv.org/abs/2501.02451