Robust WiFi-based human pose estimation is a challenging task that bridges discrete and subtle WiFi signals to human skeletons. This paper revisits this problem and reveals two critical yet overlooked issues: 1) cross-domain gap, i.e., due to significant variations between source-target domain pose distributions; and 2) structural fidelity gap, i.e., predicted skeletal poses manifest distorted topology, usually with misplaced joints and disproportionate bone lengths. This paper fills these gaps by reformulating the task into a novel two-phase framework dubbed DT-Pose: Domain-consistent representation learning and Topology-constrained Pose decoding. Concretely, we first propose a temporal-consistent contrastive learning strategy with uniformity regularization, coupled with self-supervised masking-reconstruction operations, to enable robust learning of domain-consistent and motion-discriminative WiFi-specific representations. Beyond this, we introduce a simple yet effective pose decoder with task prompts, which integrates Graph Convolution Network (GCN) and Transformer layers to constrain the topology structure of the generated skeleton by exploring the adjacent-overarching relationships among human joints. Extensive experiments conducted on various benchmark datasets highlight the superior performance of our method in tackling these fundamental challenges in both 2D/3D human pose estimation tasks.
基于WiFi的人体姿态估计是一项具有挑战性的任务,它需要将离散且微妙的WiFi信号与人体骨骼联系起来。本文重新审视了这一问题,并揭示了两个关键但被忽视的问题:1)跨域差距,即由于源领域和目标领域的姿态分布差异显著;2)结构保真度差距,即预测的人体骨骼姿势表现出扭曲的拓扑结构,通常表现为关节位置不正确及骨头长度不成比例。本文通过重构任务为一个新颖的两阶段框架来填补这些缺口,该框架被称为DT-Pose(领域一致表示学习和拓扑约束姿态解码)。具体而言,我们首先提出了一种带有均匀性正则化的时序一致性对比学习策略,并结合自监督掩蔽-重建操作,以实现领域一致性和运动判别性的WiFi特有表征的稳健学习。此外,我们引入了一个简单但有效的姿势解码器,该解码器通过集成图卷积网络(GCN)和Transformer层来约束生成骨骼的拓扑结构,并探索人体关节之间的相邻-整体关系。在多个基准数据集上进行的广泛实验表明,在二维/三维人体姿态估计任务中解决这些基本挑战时,我们的方法表现出卓越性能。
https://arxiv.org/abs/2501.09411
Few-shot class incremental learning implies the model to learn new classes while retaining knowledge of previously learned classes with a small number of training instances. Existing frameworks typically freeze the parameters of the previously learned classes during the incorporation of new classes. However, this approach often results in suboptimal class separation of previously learned classes, leading to overlap between old and new classes. Consequently, the performance of old classes degrades on new classes. To address these challenges, we propose a novel feature augmentation driven contrastive learning framework designed to enhance the separation of previously learned classes to accommodate new classes. Our approach involves augmenting feature vectors and assigning proxy labels to these vectors. This strategy expands the feature space, ensuring seamless integration of new classes within the expanded space. Additionally, we employ a self-supervised contrastive loss to improve the separation between previous classes. We validate our framework through experiments on three FSCIL benchmark datasets: CIFAR100, miniImageNet, and CUB200. The results demonstrate that our Feature Augmentation driven Contrastive Learning framework significantly outperforms other approaches, achieving state-of-the-art performance.
少量样本类别增量学习意味着模型在使用少量训练实例的情况下,能够同时学习新类别的知识并保留已学类别的知识。现有的框架通常在引入新类别时冻结先前已学类别的参数设置不变。然而,这种方法往往会导致之前已学类别间分离效果不佳,从而造成旧类别与新类别之间的重叠。因此,旧类别的性能在面对新类别时会下降。 为了解决这些问题,我们提出了一种新颖的基于特征增强驱动对比学习框架,旨在改进先前已学类别间的分离度以适应新类别的加入。我们的方法包括对特征向量进行增强,并给这些向量分配代理标签。这种策略可以扩展特征空间,在扩大的空间内实现新类别的平滑整合。此外,我们采用自监督的对比损失来优化旧类之间的区分度。 我们在三个少量样本类别增量学习基准数据集(CIFAR100、miniImageNet和CUB200)上对我们的框架进行了验证实验。结果表明,基于特征增强驱动对比学习的框架显著优于其他方法,并达到了最先进的性能水平。
https://arxiv.org/abs/2501.09361
Transformer-based encoder-decoder models have achieved remarkable success in image-to-image transfer tasks, particularly in image restoration. However, their high computational complexity-manifested in elevated FLOPs and parameter counts-limits their application in real-world scenarios. Existing knowledge distillation methods in image restoration typically employ lightweight student models that directly mimic the intermediate features and reconstruction results of the teacher, overlooking the implicit attention relationships between them. To address this, we propose a Soft Knowledge Distillation (SKD) strategy that incorporates a Multi-dimensional Cross-net Attention (MCA) mechanism for compressing image restoration models. This mechanism facilitates interaction between the student and teacher across both channel and spatial dimensions, enabling the student to implicitly learn the attention matrices. Additionally, we employ a Gaussian kernel function to measure the distance between student and teacher features in kernel space, ensuring stable and efficient feature learning. To further enhance the quality of reconstructed images, we replace the commonly used L1 or KL divergence loss with a contrastive learning loss at the image level. Experiments on three tasks-image deraining, deblurring, and denoising-demonstrate that our SKD strategy significantly reduces computational complexity while maintaining strong image restoration capabilities.
基于Transformer的编码器-解码器模型在图像到图像转换任务中,尤其是图像恢复方面取得了显著的成功。然而,这些模型由于计算复杂度高(表现为更高的FLOPs和参数计数)而限制了其在现实场景中的应用。现有的知识蒸馏方法通常采用轻量级的学生模型直接模仿教师模型的中间特征和重建结果,忽略了两者之间的隐式注意关系。为了解决这一问题,我们提出了一种Soft Knowledge Distillation(SKD)策略,该策略结合了Multi-dimensional Cross-net Attention(MCA)机制以压缩图像恢复模型。这种机制促进了学生和教师在通道和空间维度上的交互,使学生能够隐式地学习注意力矩阵。 此外,我们使用高斯核函数来衡量学生和教师特征之间的距离,确保稳定且高效的特征学习。为了进一步提高重建图像的质量,我们将常用的L1或KL散度损失替换为基于图像级别的对比学习损失。在三项任务——去雨、去模糊和降噪的实验中,我们的SKD策略显著降低了计算复杂性,并保持了强大的图像恢复能力。
https://arxiv.org/abs/2501.09321
Few-shot learning in medical image classification presents a significant challenge due to the limited availability of annotated data and the complex nature of medical imagery. In this work, we propose Adaptive Vision-Language Fine-tuning with Hierarchical Contrastive Alignment (HiCA), a novel framework that leverages the capabilities of Large Vision-Language Models (LVLMs) for medical image analysis. HiCA introduces a two-stage fine-tuning strategy, combining domain-specific pretraining and hierarchical contrastive learning to align visual and textual representations at multiple levels. We evaluate our approach on two benchmark datasets, Chest X-ray and Breast Ultrasound, achieving state-of-the-art performance in both few-shot and zero-shot settings. Further analyses demonstrate the robustness, generalizability, and interpretability of our method, with substantial improvements in performance compared to existing baselines. Our work highlights the potential of hierarchical contrastive strategies in adapting LVLMs to the unique challenges of medical imaging tasks.
在医疗图像分类中,少样本学习(few-shot learning)面临着一个显著的挑战,即标注数据的有限可用性和医学影像的复杂性。本文提出了一种新的框架——自适应视觉-语言微调与层次对比对齐(HiCA),该框架利用大规模视觉-语言模型(LVLMs)的能力来进行医疗图像分析。HiCA引入了一个两阶段的微调策略,结合领域特定预训练和分层对比学习来在多个层级上对齐视觉和文本表示。我们在两个基准数据集——胸部X光片和乳腺超声检查数据集中评估了我们的方法,在少样本(few-shot)和零样本(zero-shot)设置下均取得了最先进的性能表现。进一步的分析表明,与现有的基线相比,该方法具有更强的鲁棒性、泛化能力和可解释性,并在性能上实现了显著提升。我们的工作强调了层次对比策略在将LVLMs适应医学成像任务独特挑战方面的潜力。
https://arxiv.org/abs/2501.09294
Short text classification has gained significant attention in the information age due to its prevalence and real-world applications. Recent advancements in graph learning combined with contrastive learning have shown promising results in addressing the challenges of semantic sparsity and limited labeled data in short text classification. However, existing models have certain limitations. They rely on explicit data augmentation techniques to generate contrastive views, resulting in semantic corruption and noise. Additionally, these models only focus on learning the intrinsic consistency between the generated views, neglecting valuable discriminative information from other potential views. To address these issues, we propose a Simple graph contrastive learning framework for Short Text Classification (SimSTC). Our approach involves performing graph learning on multiple text-related component graphs to obtain multi-view text embeddings. Subsequently, we directly apply contrastive learning on these embeddings. Notably, our method eliminates the need for data augmentation operations to generate contrastive views while still leveraging the benefits of multi-view contrastive learning. Despite its simplicity, our model achieves outstanding performance, surpassing large language models on various datasets.
在信息时代,由于其普遍性和实际应用价值,短文本分类已获得了广泛关注。近期,图学习与对比学习相结合的技术,在解决短文本分类中的语义稀疏性和标注数据不足的挑战方面显示出巨大潜力。然而,现有的模型存在一定的局限性:它们依赖于显式的数据增强技术来生成对比视图,这会导致语义污染和噪声;此外,这些模型仅关注于从生成的视图中学习内在一致性,而忽视了其他潜在视图中的有价值的区别信息。 为了解决这些问题,我们提出了一种用于短文本分类的简单图对比学习框架(SimSTC)。我们的方法包括在多个与文本相关的组件图上执行图学习以获取多视角文本嵌入,随后直接在这些建模后的嵌入上应用对比学习。特别值得注意的是,我们的方法消除了生成对比视图所需的数据增强操作,同时仍然利用了多视图对比学习的好处。尽管该模型结构简单,但在各种数据集上的性能表现却非常出色,并且超过了大型语言模型的水平。
https://arxiv.org/abs/2501.09219
Short text classification, as a research subtopic in natural language processing, is more challenging due to its semantic sparsity and insufficient labeled samples in practical scenarios. We propose a novel model named MI-DELIGHT for short text classification in this work. Specifically, it first performs multi-source information (i.e., statistical information, linguistic information, and factual information) exploration to alleviate the sparsity issues. Then, the graph learning approach is adopted to learn the representation of short texts, which are presented in graph forms. Moreover, we introduce a dual-level (i.e., instance-level and cluster-level) contrastive learning auxiliary task to effectively capture different-grained contrastive information within massive unlabeled data. Meanwhile, previous models merely perform the main task and auxiliary tasks in parallel, without considering the relationship among tasks. Therefore, we introduce a hierarchical architecture to explicitly model the correlations between tasks. We conduct extensive experiments across various benchmark datasets, demonstrating that MI-DELIGHT significantly surpasses previous competitive models. It even outperforms popular large language models on several datasets.
短文本分类作为自然语言处理领域的研究子课题,因其语义稀疏性和实际场景中标签样本不足而更具挑战性。在本工作中,我们提出了一种名为MI-DELIGHT的新模型,用于解决短文本分类问题。具体来说,该模型首先进行多源信息(即统计信息、语言学信息和事实信息)的探索以缓解语义稀疏性的问题。然后,采用图学习方法来学习用图形式表示的短文本的表征。此外,我们引入了一种双层次(实例级和聚类级)对比学习辅助任务,有效捕获大量未标记数据中的不同粒度的对比信息。同时,先前的模型仅在并行执行主要任务与辅助任务时,并不考虑各任务之间的关系。因此,我们引入了分层架构以明确建模各个任务间的相关性。我们在多个基准数据集上进行了广泛的实验,结果表明MI-DELIGHT显著超越了以前的竞争模型,在某些数据集中甚至超过了流行的大规模语言模型的性能。
https://arxiv.org/abs/2501.09214
Medical images and reports offer invaluable insights into patient health. The heterogeneity and complexity of these data hinder effective analysis. To bridge this gap, we investigate contrastive learning models for cross-domain retrieval, which associates medical images with their corresponding clinical reports. This study benchmarks the robustness of four state-of-the-art contrastive learning models: CLIP, CXR-RePaiR, MedCLIP, and CXR-CLIP. We introduce an occlusion retrieval task to evaluate model performance under varying levels of image corruption. Our findings reveal that all evaluated models are highly sensitive to out-of-distribution data, as evidenced by the proportional decrease in performance with increasing occlusion levels. While MedCLIP exhibits slightly more robustness, its overall performance remains significantly behind CXR-CLIP and CXR-RePaiR. CLIP, trained on a general-purpose dataset, struggles with medical image-report retrieval, highlighting the importance of domain-specific training data. The evaluation of this work suggests that more effort needs to be spent on improving the robustness of these models. By addressing these limitations, we can develop more reliable cross-domain retrieval models for medical applications.
医学图像和报告为患者健康提供了宝贵的见解。然而,这些数据的异质性和复杂性阻碍了有效的分析。为了弥合这一差距,我们研究了一种用于跨域检索的对比学习模型,该模型将医疗图像与其相应的临床报告关联起来。这项研究对四种最先进的对比学习模型(CLIP、CXR-RePaiR、MedCLIP 和 CXR-CLIP)进行了基准测试,以评估它们在不同领域的鲁棒性。我们引入了一项遮挡检索任务来评估在不同程度的数据损坏情况下这些模型的性能。我们的发现表明,所有被评估的模型都对域外数据高度敏感,随着遮挡水平的增加,其性能显著下降。尽管 MedCLIP 在一定程度上更具鲁棒性,但其整体性能仍然远远落后于 CXR-CLIP 和 CXR-RePaiR。训练在通用数据集上的 CLIP,在处理医学图像报告检索时表现不佳,突显了特定领域数据的重要性。这项工作的评估表明,需要更多努力来提高这些模型的鲁棒性。通过解决这些局限性,我们可以为医疗应用开发出更可靠的跨域检索模型。
https://arxiv.org/abs/2501.09134
In this project, we address the issue of infidelity in text-to-image generation, particularly for actions involving multiple objects. For this we build on top of the CONFORM framework which uses Contrastive Learning to improve the accuracy of the generated image for multiple objects. However the depiction of actions which involves multiple different object has still large room for improvement. To improve, we employ semantically hypergraphic contrastive adjacency learning, a comprehension of enhanced contrastive structure and "contrast but link" technique. We further amend Stable Diffusion's understanding of actions by InteractDiffusion. As evaluation metrics we use image-text similarity CLIP and TIFA. In addition, we conducted a user study. Our method shows promising results even with verbs that Stable Diffusion understands mediocrely. We then provide future directions by analyzing the results. Our codebase can be found on polybox under the link: this https URL
在这个项目中,我们解决了从文本生成图像时的不忠问题,特别是在涉及多个对象的动作场景中。为此,我们在CONFORM框架的基础上进行构建,该框架使用对比学习来提高包含多个对象的生成图像的准确性。然而,描绘涉及多种不同对象的动作仍然有很大的改进空间。为了解决这个问题,我们采用了一种语义超图对比邻近性学习方法,并引入了增强型对比结构理解和“对比但链接”技术。此外,我们通过InteractDiffusion进一步完善Stable Diffusion对动作的理解。我们的评估指标包括图像-文本相似度的CLIP和TIFA。此外,我们还进行了一项用户研究。即使对于Stable Diffusion理解得不那么好的动词,我们的方法也表现出有前景的结果。最后,通过对结果的分析,我们提供了未来的研究方向。我们的代码库可以在polybox下的此链接中找到:this https URL
https://arxiv.org/abs/2501.09055
Foundation models (FMs) have shown transformative potential in radiology by performing diverse, complex tasks across imaging modalities. Here, we developed CT-FM, a large-scale 3D image-based pre-trained model designed explicitly for various radiological tasks. CT-FM was pre-trained using 148,000 computed tomography (CT) scans from the Imaging Data Commons through label-agnostic contrastive learning. We evaluated CT-FM across four categories of tasks, namely, whole-body and tumor segmentation, head CT triage, medical image retrieval, and semantic understanding, showing superior performance against state-of-the-art models. Beyond quantitative success, CT-FM demonstrated the ability to cluster regions anatomically and identify similar anatomical and structural concepts across scans. Furthermore, it remained robust across test-retest settings and indicated reasonable salient regions attached to its embeddings. This study demonstrates the value of large-scale medical imaging foundation models and by open-sourcing the model weights, code, and data, aims to support more adaptable, reliable, and interpretable AI solutions in radiology.
基础模型(FMs)在放射学中展示了变革潜力,能够在不同的成像模式下执行多样且复杂的任务。在这里,我们开发了CT-FM,这是一个大规模的基于3D图像的预训练模型,专门针对各种放射学任务设计。CT-FM通过无标签对比学习方法,在来自影像数据公用库(Imaging Data Commons)的148,000个计算机断层扫描(CT)扫描基础上进行了预训练。 我们评估了CT-FM在四个类别的任务中的表现,包括全身和肿瘤分割、头部CT分类、医学图像检索以及语义理解。结果显示,在所有类别中,CT-FM的表现均优于最先进的模型。 除了量化的成功之外,CT-FM还展示出将解剖区域进行聚类,并能够在不同扫描之间识别类似的解剖结构的能力。此外,它在测试-重测设置下仍然保持了其稳健性,并显示出了与其嵌入关联的合理显著区域。 本研究证明了大规模医学影像基础模型的价值,并通过开源该模型权重、代码和数据,旨在支持放射学领域中更灵活、可靠且可解释的人工智能解决方案。
https://arxiv.org/abs/2501.09001
Background: Adolescents are particularly vulnerable to mental disorders, with over 75% of cases manifesting before the age of 25. Research indicates that only 18 to 34% of young people experiencing high levels of depression or anxiety symptoms seek support. Digital tools leveraging smartphones offer scalable and early intervention opportunities. Objective: Using a novel machine learning framework, this study evaluated the feasibility of integrating active and passive smartphone data to predict mental disorders in non-clinical adolescents. Specifically, we investigated the utility of the Mindcraft app in predicting risks for internalising and externalising disorders, eating disorders, insomnia and suicidal ideation. Methods: Participants (N=103; mean age 16.1 years) were recruited from three London schools. Participants completed the Strengths and Difficulties Questionnaire, the Eating Disorders-15 Questionnaire, Sleep Condition Indicator Questionnaire and indicated the presence/absence of suicidal ideation. They used the Mindcraft app for 14 days, contributing active data via self-reports and passive data from smartphone sensors. A contrastive pretraining phase was applied to enhance user-specific feature stability, followed by supervised fine-tuning. The model evaluation employed leave-one-subject-out cross-validation using balanced accuracy as the primary metric. Results: The integration of active and passive data achieved superior performance compared to individual data sources, with mean balanced accuracies of 0.71 for SDQ-High risk, 0.67 for insomnia, 0.77 for suicidal ideation and 0.70 for eating disorders. The contrastive learning framework stabilised daily behavioural representations, enhancing predictive robustness. This study demonstrates the potential of integrating active and passive smartphone data with advanced machine-learning techniques for predicting mental health risks.
背景:青少年特别容易受到精神疾病的影响,超过75%的病例在25岁之前出现。研究显示,只有18到34%经历高水平抑郁或焦虑症状的年轻人寻求帮助。利用智能手机的应用程序提供了可扩展和早期干预的机会。 目的:本研究使用一种新的机器学习框架评估了将主动和被动手机数据整合用于预测非临床青少年精神疾病的可行性。具体来说,我们调查了Mindcraft应用程序在预测内向性和外向性障碍、饮食紊乱症、失眠和自杀念头风险方面的实用性。 方法:参与者(N=103;平均年龄16.1岁)来自伦敦的三所学校。他们完成了《力量与困难量表》(Strengths and Difficulties Questionnaire)、《饮食失调-15问卷》(Eating Disorders-15 Questionnaire)、《睡眠状况指标问卷》(Sleep Condition Indicator Questionnaire),并报告了是否存在自杀念头。参与者使用Mindcraft应用程序14天,通过自我报告贡献主动数据,并从手机传感器中提供被动数据。应用对比预训练阶段以增强用户特定特征的稳定性,随后进行监督微调。模型评估采用留一受试者交叉验证方法,主要指标为平衡准确性。 结果:整合主动和被动数据相比单独的数据来源取得了更优的表现,具体表现为SDQ-High风险平均平衡准确率为0.71,失眠为0.67,自杀念头为0.77,饮食紊乱症为0.70。对比学习框架稳定了日常行为表示,增强了预测的稳健性。 本研究展示了将主动和被动手机数据与先进机器学习技术相结合以预测心理健康风险的巨大潜力。
https://arxiv.org/abs/2501.08851
This paper proposes the ViT Token Constraint and Multi-scale Memory bank (TCMM) method to address the patch noises and feature inconsistency in unsupervised person re-identification works. Many excellent methods use ViT features to obtain pseudo labels and clustering prototypes, then train the model with contrastive learning. However, ViT processes images by performing patch embedding, which inevitably introduces noise in patches and may compromise the performance of the re-identification model. On the other hand, previous memory bank based contrastive methods may lead data inconsistency due to the limitation of batch size. Furthermore, existing pseudo label methods often discard outlier samples that are difficult to cluster. It sacrifices the potential value of outlier samples, leading to limited model diversity and robustness. This paper introduces the ViT Token Constraint to mitigate the damage caused by patch noises to the ViT architecture. The proposed Multi-scale Memory enhances the exploration of outlier samples and maintains feature consistency. Experimental results demonstrate that our system achieves state-of-the-art performance on common benchmarks. The project is available at \href{this https URL}{this https URL}.
本文提出了ViT Token Constraint和多尺度记忆库(TCMM)方法,旨在解决无监督行人重识别工作中出现的补丁噪声和特征不一致性问题。许多优秀的方法利用ViT特征来获取伪标签和聚类原型,然后通过对比学习训练模型。然而,由于ViT是通过对图像进行补丁嵌入处理来工作的,这不可避免地会在各个补丁中引入噪声,并可能影响重识别模型的性能。另一方面,基于以往记忆库的对比方法可能会因为批次大小的限制而导致数据不一致的问题。此外,现有的伪标签生成方法往往忽略那些难以聚类的异常样本,从而牺牲了这些异常样本的潜在价值,导致模型多样性和鲁棒性的局限性。 本文介绍了一种ViT Token Constraint技术来减轻补丁噪声对ViT架构造成的损害。提出的多尺度记忆库则增强了对异常样本的探索,并保持了特征的一致性。实验结果显示,我们的系统在常见基准测试中达到了最先进的性能水平。该项目可在[此处](this https URL)访问。
https://arxiv.org/abs/2501.09044
Semantic segmentation is essential for comprehending images, but the process necessitates a substantial amount of detailed annotations at the pixel level. Acquiring such annotations can be costly in the real-world. Unsupervised domain adaptation (UDA) for semantic segmentation is a technique that uses virtual data with labels to train a model and adapts it to real data without labels. Some recent works use contrastive learning, which is a powerful method for self-supervised learning, to help with this technique. However, these works do not take into account the diversity of features within each class when using contrastive learning, which leads to errors in class prediction. We analyze the limitations of these works and propose a novel framework called Pseudo-label Guided Pixel Contrast (PGPC), which overcomes the disadvantages of previous methods. We also investigate how to use more information from target images without adding noise from pseudo-labels. We test our method on two standard UDA benchmarks and show that it outperforms existing methods. Specifically, we achieve relative improvements of 5.1% mIoU and 4.6% mIoU on the Grand Theft Auto V (GTA5) to Cityscapes and SYNTHIA to Cityscapes tasks based on DAFormer, respectively. Furthermore, our approach can enhance the performance of other UDA approaches without increasing model complexity. Code is available at this https URL
语义分割对于理解图像至关重要,但这一过程需要大量的像素级详细标注。在现实世界中获取这些标注可能会非常昂贵。无监督领域适应(UDA)是一种利用带有标签的虚拟数据来训练模型,并将其调整应用于没有标签的真实数据的技术。最近的一些研究工作使用对比学习这种强大的自监督学习方法来进行此技术,但它们未能考虑到每类内部特征的多样性,在使用对比学习时会导致类别预测错误。我们分析了这些工作的局限性,并提出了一种名为伪标签引导像素对比(PGPC)的新框架来克服先前方法的缺点。此外,我们还研究如何利用目标图像中的更多信息而不引入来自伪标签的噪声。我们在两个标准UDA基准测试上测试了我们的方法,并表明它优于现有方法。具体来说,在基于DAFormer的GTA5到Cityscapes和SYNTHIA到Cityscapes任务中,分别实现了5.1%mIoU和4.6%mIoU的相对改进。此外,我们的方法可以增强其他UDA方法的表现而不增加模型复杂度。代码可在以下链接获取:[此处插入URL]
https://arxiv.org/abs/2501.09040
Remote sensing imagery is dense with objects and contextual visual information. There is a recent trend to combine paired satellite images and text captions for pretraining performant encoders for downstream tasks. However, while contrastive image-text methods like CLIP enable vision-language alignment and zero-shot classification ability, vision-only downstream performance tends to degrade compared to image-only pretraining, such as MAE. In this paper, we propose FLAVARS, a pretraining method that combines the best of both contrastive learning and masked modeling, along with geospatial alignment via contrastive location encoding. We find that FLAVARS significantly outperforms a baseline of SkyCLIP for vision-only tasks such as KNN classification and semantic segmentation, +6\% mIOU on SpaceNet1, while retaining the ability to perform zero-shot classification, unlike MAE pretrained methods.
遥感图像包含大量的对象和上下文视觉信息。最近的趋势是结合配对的卫星图像和文本描述进行预训练,以提高编码器在下游任务中的性能。然而,虽然对比学习方法(如CLIP)能够实现视觉与语言的一致性以及零样本分类能力,但仅基于视觉的任务性能通常会低于只使用图像的预训练方法,例如MAE。 本文提出了FLAVARS,这是一种结合了对比学习和掩码建模优点,并通过对比位置编码进行地理空间对齐的预训练方法。我们发现,在诸如KNN分类和语义分割等仅基于视觉的任务上,FLAVARS显著优于SkyCLIP基线方法,在SpaceNet1数据集上的mIOU指标提高了6%。同时,与只使用MAE预训练的方法不同,FLAVARS保留了零样本分类的能力。
https://arxiv.org/abs/2501.08490
Learning from tabular data is of paramount importance, as it complements the conventional analysis of image and video data by providing a rich source of structured information that is often critical for comprehensive understanding and decision-making processes. We present Multi-task Contrastive Masked Tabular Modeling (MT-CMTM), a novel method aiming to enhance tabular models by leveraging the correlation between tabular data and corresponding images. MT-CMTM employs a dual strategy combining contrastive learning with masked tabular modeling, optimizing the synergy between these data modalities. Central to our approach is a 1D Convolutional Neural Network with residual connections and an attention mechanism (1D-ResNet-CBAM), designed to efficiently process tabular data without relying on images. This enables MT-CMTM to handle purely tabular data for downstream tasks, eliminating the need for potentially costly image acquisition and processing. We evaluated MT-CMTM on the DVM car dataset, which is uniquely suited for this particular scenario, and the newly developed HIPMP dataset, which connects membrane fabrication parameters with image data. Our MT-CMTM model outperforms the proposed tabular 1D-ResNet-CBAM, which is trained from scratch, achieving a relative 1.48% improvement in relative MSE on HIPMP and a 2.38% increase in absolute accuracy on DVM. These results demonstrate MT-CMTM's robustness and its potential to advance the field of multi-modal learning.
从表格数据中学习至关重要,因为它通过提供结构化的信息来源来补充对图像和视频数据分析的传统方法,这些信息对于全面理解和决策过程往往是关键的。我们提出了一种新的多任务对比掩码表单建模(MT-CMTM)方法,旨在通过利用表格数据与相应图像之间的相关性来增强表格模型。MT-CMTM采用结合了对比学习和掩码表格建模的双管齐下策略,优化这些数据模式之间的协同作用。 我们的方法的核心是一个具有残差连接和注意力机制的一维卷积神经网络(1D-ResNet-CBAM),设计用于高效处理表格数据而不依赖于图像。这使得MT-CMTM能够仅使用纯表格数据来应对下游任务,并消除可能昂贵的图像采集和处理需求。 我们在DVM汽车数据集上对MT-CMTM进行了评估,该数据集特别适合这种特定场景,以及新开发的HIPMP数据集,该数据集将膜制造参数与图像数据连接起来。我们的MT-CMTM模型优于从头开始训练的表格1D-ResNet-CBAM,在HIPMP上的相对MSE提高了1.48%,在DVM上的绝对准确度提高了2.38%。这些结果证明了MT-CMTM的强大性能及其在多模态学习领域的潜力。 总结翻译如下: 从表格数据中进行学习至关重要,因为它通过提供结构化信息来补充图像和视频数据分析的传统方法,并且这种信息对于全面理解和决策过程往往是至关重要的。我们提出了一种名为“多任务对比掩码表单建模”(MT-CMTM)的新方法,旨在通过利用表格数据与相应图像之间的相关性来增强表格模型的表现力。该方法采用结合了对比学习和掩码表格建模的策略,并优化这些模式间的协同作用。 我们方法的核心是一个具备残差连接及注意力机制的一维卷积神经网络(1D-ResNet-CBAM),专门用于有效处理表格数据,无需依赖图像信息。这使MT-CMTM能够在下游任务中仅使用纯表格数据进行操作,从而避免了昂贵的图像采集和处理需求。 我们在两个特定的数据集上评估了这种方法:一个是专门为该方法设计的独特汽车数据集(DVM),另一个是将膜制造参数与图像数据相联系的新开发的HIPMP数据集。在这些数据集中,MT-CMTM模型的表现优于从头开始训练的基本1D-ResNet-CBAM模型,在HIPMP上的相对均方误差提高了1.48%,而在DVM上的绝对准确度提高了2.38%。 以上结果表明了MT-CMTM的稳健性和其在多模态学习领域的巨大潜力。
https://arxiv.org/abs/2501.07304
3D single object tracking (3DSOT) in LiDAR point clouds is a critical task for outdoor perception, enabling real-time perception of object location, orientation, and motion. Despite the impressive performance of current 3DSOT methods, evaluating them on clean datasets inadequately reflects their comprehensive performance, as the adverse weather conditions in real-world surroundings has not been considered. One of the main obstacles is the lack of adverse weather benchmarks for the evaluation of 3DSOT. To this end, this work proposes a challenging benchmark for LiDAR-based 3DSOT in adverse weather, which comprises two synthetic datasets (KITTI-A and nuScenes-A) and one real-world dataset (CADC-SOT) spanning three weather types: rain, fog, and snow. Based on this benchmark, five representative 3D trackers from different tracking frameworks conducted robustness evaluation, resulting in significant performance degradations. This prompts the question: What are the factors that cause current advanced methods to fail on such adverse weather samples? Consequently, we explore the impacts of adverse weather and answer the above question from three perspectives: 1) target distance; 2) template shape corruption; and 3) target shape corruption. Finally, based on domain randomization and contrastive learning, we designed a dual-branch tracking framework for adverse weather, named DRCT, achieving excellent performance in benchmarks.
基于激光雷达点云的单个物体三维跟踪(3DSOT)是室外感知的关键任务,它能够实现目标对象位置、姿态和运动的实时感知。尽管目前的3DSOT方法表现出色,但仅在清洁数据集上进行评估无法全面反映其性能,因为现实世界中的恶劣天气条件未被充分考虑。其中一个主要障碍是没有为3DSOT评估设计的恶劣天气基准测试。 为此,这项工作提出了一个具有挑战性的基于激光雷达的3DSOT恶劣天气基准测试,包括两个合成数据集(KITTI-A和nuScenes-A)以及一个真实世界的数据集(CADC-SOT),涵盖了雨、雾和雪三种类型的恶劣天气。根据这一基准,来自不同跟踪框架的五个代表性的3D追踪器进行了鲁棒性评估,结果表明性能显著下降。这引发了问题:是什么因素导致当前先进的方法在这种恶劣天气样本上表现不佳?因此,我们从三个方面探讨了恶劣天气的影响,并回答了上述问题:1)目标距离;2)模板形状损坏;和3)目标形状损坏。 最后,在领域随机化和对比学习的基础上,我们设计了一个用于恶劣天气的双分支跟踪框架DRCT(Domain Randomization and Contrastive Learning-based Dual-branch Tracker),在基准测试中取得了卓越的成绩。
https://arxiv.org/abs/2501.07133
In deep regression, capturing the relationship among continuous labels in feature space is a fundamental challenge that has attracted increasing interest. Addressing this issue can prevent models from converging to suboptimal solutions across various regression tasks, leading to improved performance, especially for imbalanced regression and under limited sample sizes. However, existing approaches often rely on order-aware representation learning or distance-based weighting. In this paper, we hypothesize a linear negative correlation between label distances and representation similarities in regression tasks. To implement this, we propose an angle-compensated contrastive regularizer for deep regression, which adjusts the cosine distance between anchor and negative samples within the contrastive learning framework. Our method offers a plug-and-play compatible solution that extends most existing contrastive learning methods for regression tasks. Extensive experiments and theoretical analysis demonstrate that our proposed angle-compensated contrastive regularizer not only achieves competitive regression performance but also excels in data efficiency and effectiveness on imbalanced datasets.
在深度回归中,捕捉特征空间内连续标签之间的关系是一项基础挑战,并且这一领域已经吸引了越来越多的关注。解决这个问题可以防止模型在各种回归任务中收敛到次优解,从而提高性能,尤其是在不平衡的回归和样本量有限的情况下。然而,现有方法通常依赖于顺序感知表示学习或基于距离加权的方法。在这篇论文中,我们假设在回归任务中标签之间的距离与表征相似度之间存在线性负相关关系。为此,我们提出了一种角度补偿对比正则化器用于深度回归,在对比学习框架内调整锚样本和负样本之间的余弦距离。我们的方法提供了一个即插即用的兼容解决方案,可以扩展大多数现有的针对回归任务的对比学习方法。广泛的实验和理论分析表明,我们提出的角度补偿对比正则化器不仅在回归性能上达到了竞争水平,而且在处理不平衡数据集的数据效率和有效性方面表现出色。
https://arxiv.org/abs/2501.07045
In business analysis, providing effective recommendations is essential for enhancing company profits. The utilization of graph-based structures, such as bipartite graphs, has gained popularity for their ability to analyze complex data relationships. Link prediction is crucial for recommending specific items to users. Traditional methods in this area often involve identifying patterns in the graph structure or using representational techniques like graph neural networks (GNNs). However, these approaches encounter difficulties as the volume of data increases. To address these challenges, we propose a model called Graph Contrastive Learning for Multi-label Classification (MCGCL). MCGCL leverages contrastive learning to enhance recommendation effectiveness. The model incorporates two training stages: a main task and a subtask. The main task is holistic user-item graph learning to capture user-item relationships. The homogeneous user-user (item-item) subgraph is constructed to capture user-user and item-item relationships in the subtask. We assessed the performance using real-world datasets from Amazon Reviews in multi-label classification tasks. Comparative experiments with state-of-the-art methods confirm the effectiveness of MCGCL, highlighting its potential for improving recommendation systems.
在商业分析中,提供有效的建议对于提升公司利润至关重要。基于图结构的方法,如二分图,因其能够分析复杂的数据关系而越来越受欢迎。链接预测对于向用户推荐特定项目至关重要。传统方法通常包括识别图形结构中的模式或使用表示技术(例如图神经网络(GNNs))。然而,随着数据量的增加,这些方法遇到了困难。为了解决这些问题,我们提出了一种名为多标签分类图对比学习模型(MCGCL)的方法。MCGCL利用对比学习来增强推荐效果的有效性。该模型包括两个训练阶段:主要任务和子任务。主要任务是进行整体用户-项目图学习以捕捉用户与项目的关联关系。在子任务中,构建同构的用户-用户(项目-项目)子图以捕捉用户之间的关系以及项目之间的关系。 我们使用来自亚马逊评论的真实世界数据集进行了多标签分类任务的性能评估。通过与最先进的方法进行比较实验,确认了MCGCL的有效性,并突显其改善推荐系统潜力的能力。
https://arxiv.org/abs/2501.06985
The automatic identification of Magnetic Resonance Imaging (MRI) sequences can streamline clinical workflows by reducing the time radiologists spend manually sorting and identifying sequences, thereby enabling faster diagnosis and treatment planning for patients. However, the lack of standardization in the parameters of MRI scans poses challenges for automated systems and complicates the generation and utilization of datasets for machine learning research. To address this issue, we propose a system for MRI sequence identification using an unsupervised contrastive deep learning framework. By training a convolutional neural network based on the ResNet-18 architecture, our system classifies nine common MRI sequence types as a 9-class classification problem. The network was trained using an in-house internal dataset and validated on several public datasets, including BraTS, ADNI, Fused Radiology-Pathology Prostate Dataset, the Breast Cancer Dataset (ACRIN), among others, encompassing diverse acquisition protocols and requiring only 2D slices for training. Our system achieves a classification accuracy of over 0.95 across the nine most common MRI sequence types.
磁共振成像(MRI)序列的自动识别可以简化临床工作流程,通过减少放射科医生手动排序和识别序列的时间,从而加快患者的诊断和治疗计划。然而,MRI扫描参数缺乏标准化给自动化系统带来了挑战,并且使机器学习研究中数据集的生成和利用变得更加复杂。为了解决这个问题,我们提出了一种使用无监督对比深度学习框架进行MRI序列识别的系统。通过基于ResNet-18架构训练卷积神经网络,我们的系统将九种常见的MRI序列类型作为9类分类问题来进行分类。该网络使用内部数据集进行训练,并在包括BraTS、ADNI、融合放射病理前列腺数据集以及乳腺癌数据集(ACRIN)在内的多个公共数据集上进行了验证,这些数据集涵盖了不同的采集协议,并且仅需2D切片即可完成训练。我们的系统在这九种最常见的MRI序列类型中实现了超过95%的分类准确率。
https://arxiv.org/abs/2501.06938
Depicting novel classes with language descriptions by observing few-shot samples is inherent in human-learning systems. This lifelong learning capability helps to distinguish new knowledge from old ones through the increase of open-world learning, namely Few-Shot Class-Incremental Learning (FSCIL). Existing works to solve this problem mainly rely on the careful tuning of visual encoders, which shows an evident trade-off between the base knowledge and incremental ones. Motivated by human learning systems, we propose a new Language-inspired Relation Transfer (LRT) paradigm to understand objects by joint visual clues and text depictions, composed of two major steps. We first transfer the pretrained text knowledge to the visual domains by proposing a graph relation transformation module and then fuse the visual and language embedding by a text-vision prototypical fusion module. Second, to mitigate the domain gap caused by visual finetuning, we propose context prompt learning for fast domain alignment and imagined contrastive learning to alleviate the insufficient text data during alignment. With collaborative learning of domain alignments and text-image transfer, our proposed LRT outperforms the state-of-the-art models by over $13\%$ and $7\%$ on the final session of mini-ImageNet and CIFAR-100 FSCIL benchmarks.
通过观察少量样本来用语言描述描绘新型类别的能力是人类学习系统固有的特性。随着开放世界学习的增加,即Few-Shot Class-Incremental Learning(FSCIL),这种终身学习的能力有助于区分新知识与旧知识。现有的解决这一问题的方法主要依赖于对视觉编码器的精细调整,这显示出基线知识和增量知识之间的明显权衡。受人类学习系统的启发,我们提出了一种新的语言启发的关系转移(LRT)范式,通过联合视觉线索和文本描述来理解对象。该范式由两个主要步骤组成:首先,我们通过提议一种图关系转换模块将预训练的文本知识转移到视觉领域;然后,利用一个文本-图像原型融合模块融合视觉和语言嵌入。其次,为了缓解由于视觉微调引起的域差距,我们提出了上下文提示学习以实现快速域对齐,并且提出了想象对比学习来缓解在对齐过程中出现的不足文本数据问题。 通过域对齐与文本图像传输的协作学习,我们的LRT范式在mini-ImageNet和CIFAR-100 FSCIL基准测试的最后一阶段分别比最先进的模型高出超过13%和7%。
https://arxiv.org/abs/2501.05862
Prototype-based federated learning has emerged as a promising approach that shares lightweight prototypes to transfer knowledge among clients with data heterogeneity in a model-agnostic manner. However, existing methods often collect prototypes directly from local models, which inevitably introduce inconsistencies into representation learning due to the biased data distributions and differing model architectures among clients. In this paper, we identify that both statistical and model heterogeneity create a vicious cycle of representation inconsistency, classifier divergence, and skewed prototype alignment, which negatively impacts the performance of clients. To break the vicious cycle, we propose a novel framework named Federated Learning via Semantic Anchors (FedSA) to decouple the generation of prototypes from local representation learning. We introduce a novel perspective that uses simple yet effective semantic anchors serving as prototypes to guide local models in learning consistent representations. By incorporating semantic anchors, we further propose anchor-based regularization with margin-enhanced contrastive learning and anchor-based classifier calibration to correct feature extractors and calibrate classifiers across clients, achieving intra-class compactness and inter-class separability of prototypes while ensuring consistent decision boundaries. We then update the semantic anchors with these consistent and discriminative prototypes, which iteratively encourage clients to collaboratively learn a unified data representation with robust generalization. Extensive experiments under both statistical and model heterogeneity settings show that FedSA significantly outperforms existing prototype-based FL methods on various classification tasks.
基于原型的联邦学习作为一种有前景的方法,通过分享轻量级的原型来在模型无关的方式下,在数据异构性的情况下转移知识于客户端之间。然而,现有方法通常直接从本地模型中收集原型,不可避免地由于偏置的数据分布和不同的模型架构导致表示学习中的不一致性。在这篇论文中,我们识别出统计和模型异质性共同形成了一个恶性循环:表现形式的一致性问题、分类器的分歧以及原型对齐的偏差,这对客户端的表现产生了负面影响。为了打破这个恶性循环,我们提出了一种名为语义锚点联邦学习(FedSA)的新框架来解耦原型生成与本地表示学习的过程。 我们的方法引入了一个新的视角:使用简单但有效的语义锚点作为原型引导本地模型在一致性表征的学习中起到指导作用。通过利用这些语义锚点,我们进一步提出了基于锚点的正则化和增强对比学习及基于锚点的分类器校准,以修正特征提取器并调整个体客户端上的分类器,从而实现类内紧凑性和类间区分性的同时确保决策边界的统一性。 随后,我们迭代更新这些语义锚点为一致且具判别性的原型。这一过程鼓励客户端协作学习出一个统一的数据表示,具备强大的泛化能力。在统计和模型异质性设置下的广泛实验显示,FedSA在各种分类任务上显著优于现有的基于原型的联邦学习方法。
https://arxiv.org/abs/2501.05496