Automated medical diagnosis through image-based neural networks has increased in popularity and matured over years. Nevertheless, it is confined by the scarcity of medical images and the expensive labor annotation costs. Self-Supervised Learning (SSL) is an good alternative to Transfer Learning (TL) and is suitable for imbalanced image datasets. In this study, we assess four pretrained SSL models and two TL models in treatable retinal diseases classification using small-scale Optical Coherence Tomography (OCT) images ranging from 125 to 4000 with balanced or imbalanced distribution for training. The proposed SSL model achieves the state-of-art accuracy of 98.84% using only 4,000 training images. Our results suggest the SSL models provide superior performance under both the balanced and imbalanced training scenarios. The SSL model with MoCo-v2 scheme has consistent good performance under the imbalanced scenario and, especially, surpasses the other models when the training set is less than 500 images.
通过基于图像的神经网络进行自动医疗诊断已经在近几年广受欢迎并得到成熟。然而,由于医疗图像的稀缺性和昂贵的劳动标注成本,它仍然受到限制。自监督学习(SSL)是一种好的替代Transfer Learning(TL)的方法,适用于不平衡图像数据集。在本文中,我们评估了四种预训练的SSL模型和两种TL模型在可治疗性眼病分类中的表现,使用大小从125到4000的平衡或不平衡分布的奥普光干涉图(OCT)图像进行训练。与仅使用4,000个训练图像实现98.84%的尖端准确度相比,所提出的SSL模型具有出色的性能。我们的结果表明,在平衡和不平衡训练场景下,SSL模型都具有卓越的性能。使用MoCo-v2方案的SSL模型在不平衡场景下的表现始终良好,尤其是当训练集小于500张图片时,更优于其他模型。
https://arxiv.org/abs/2404.10166
Automated segmentation is a fundamental medical image analysis task, which enjoys significant advances due to the advent of deep learning. While foundation models have been useful in natural language processing and some vision tasks for some time, the foundation model developed with image segmentation in mind - Segment Anything Model (SAM) - has been developed only recently and has shown similar promise. However, there are still no systematic analyses or ``best-practice'' guidelines for optimal fine-tuning of SAM for medical image segmentation. This work summarizes existing fine-tuning strategies with various backbone architectures, model components, and fine-tuning algorithms across 18 combinations, and evaluates them on 17 datasets covering all common radiology modalities. Our study reveals that (1) fine-tuning SAM leads to slightly better performance than previous segmentation methods, (2) fine-tuning strategies that use parameter-efficient learning in both the encoder and decoder are superior to other strategies, (3) network architecture has a small impact on final performance, (4) further training SAM with self-supervised learning can improve final model performance. We also demonstrate the ineffectiveness of some methods popular in the literature and further expand our experiments into few-shot and prompt-based settings. Lastly, we released our code and MRI-specific fine-tuned weights, which consistently obtained superior performance over the original SAM, at this https URL.
自动分割是医学图像分析的基本任务,在深度学习出现后取得了显著的进步。虽然基础模型在自然语言处理和某些视觉任务上已经有所帮助,但专门为图像分割开发的基础模型 - Segment Anything Model (SAM) - 仅最近才开发,并显示出与原SAM相似的潜力。然而,在医疗图像分割的优化细调方面,还没有系统的分析或“最佳实践”指南。本文总结了18种不同骨干架构、模型组件和细调算法的现有细调策略,并在包括所有常见放射学模态的17个数据集上对其进行了评估。我们的研究显示,(1)细调SAM slightly提高了性能, (2)使用参数高效的编码器和解码器策略的细调策略优于其他策略, (3)网络架构对最终性能的影响很小, (4)使用自监督学习进一步训练SAM可以提高最终模型性能。我们还证明了文献中流行的一些方法的无效性,并将实验扩展到基于少样本和提示的设置。最后,我们发布了我们的代码和专用的MRI细调权重,这些权重在原SAM上 consistently取得了卓越的性能,可以在该链接处访问:https://url.
https://arxiv.org/abs/2404.09957
We explore on-device self-supervised collaborative fine-tuning of large language models with limited local data availability. Taking inspiration from the collaborative learning community, we introduce three distinct trust-weighted gradient aggregation schemes: weight similarity-based, prediction similarity-based and validation performance-based. To minimize communication overhead, we integrate Low-Rank Adaptation (LoRA) and only exchange LoRA weight updates. Our protocols, driven by prediction and performance metrics, surpass both FedAvg and local fine-tuning methods, which is particularly evident in realistic scenarios with more diverse local data distributions. The results underscore the effectiveness of our approach in addressing heterogeneity and scarcity within local datasets.
我们研究了在有限本地数据可用性下对大型语言模型的自监督协同微调。从协同学习社区中汲取灵感,我们引入了三种不同的信任加权梯度聚合方案:基于权重相似度的、基于预测相似度的和基于验证性能的。为了最小化通信开销,我们集成了 Low-Rank Adaptation(LoRA),并且只交换 LoRA 权重更新。我们的协议,由预测和性能指标驱动,超越了 FedAvg 和局部微调方法,尤其是在具有更丰富本地数据分布的现实场景中,这种效果尤为明显。结果证实了我们在解决局部数据异质性和稀疏性问题方面的方法的有效性。
https://arxiv.org/abs/2404.09753
Self-supervised learning (SSL) has emerged as a promising solution for addressing the challenge of limited labeled data in deep neural networks (DNNs), offering scalability potential. However, the impact of design dependencies within the SSL framework remains insufficiently investigated. In this study, we comprehensively explore SSL behavior across a spectrum of augmentations, revealing their crucial role in shaping SSL model performance and learning mechanisms. Leveraging these insights, we propose a novel learning approach that integrates prior knowledge, with the aim of curtailing the need for extensive data augmentations and thereby amplifying the efficacy of learned representations. Notably, our findings underscore that SSL models imbued with prior knowledge exhibit reduced texture bias, diminished reliance on shortcuts and augmentations, and improved robustness against both natural and adversarial corruptions. These findings not only illuminate a new direction in SSL research, but also pave the way for enhancing DNN performance while concurrently alleviating the imperative for intensive data augmentation, thereby enhancing scalability and real-world problem-solving capabilities.
自监督学习(SSL)作为一种解决深度神经网络(DNN)中有限标注数据问题的有前途的解决方案,具有可扩展性潜力。然而,在SSL框架内设计依赖关系的影响仍然不够深入研究。在本文中,我们全面探讨了SSL在各种增强方法上的行为,揭示了它们在塑造SSL模型性能和学习机制中的关键作用。利用这些见解,我们提出了一个新学习方法,旨在结合先验知识,以抑制对广泛数据增强的需求,从而提高所学表示的效力。值得注意的是,我们的研究结果表明,预先训练的SSL模型具有降低纹理偏差、减小对短路和增强的依赖,以及对抗自然和对抗性失真增强的改善的特性。这些发现不仅阐明了SSL研究的新方向,而且为同时减轻数据增强的必要性,提高DNN性能,增强可扩展性和现实问题解决能力铺平道路。
https://arxiv.org/abs/2404.09752
The task of face reenactment is to transfer the head motion and facial expressions from a driving video to the appearance of a source image, which may be of a different person (cross-reenactment). Most existing methods are CNN-based and estimate optical flow from the source image to the current driving frame, which is then inpainted and refined to produce the output animation. We propose a transformer-based encoder for computing a set-latent representation of the source image(s). We then predict the output color of a query pixel using a transformer-based decoder, which is conditioned with keypoints and a facial expression vector extracted from the driving frame. Latent representations of the source person are learned in a self-supervised manner that factorize their appearance, head pose, and facial expressions. Thus, they are perfectly suited for cross-reenactment. In contrast to most related work, our method naturally extends to multiple source images and can thus adapt to person-specific facial dynamics. We also propose data augmentation and regularization schemes that are necessary to prevent overfitting and support generalizability of the learned representations. We evaluated our approach in a randomized user study. The results indicate superior performance compared to the state-of-the-art in terms of motion transfer quality and temporal consistency.
面部复原的任务是将来自驾驶视频的头动量和面部表情转移到源图像的 appearance上,这可能是不同的人(跨复原)。现有的方法基于CNN,估计源图像到当前驾驶帧的视差,然后修复和优化以产生输出动画。我们提出了一种基于Transformer的编码器来计算源图像的集合潜在表示。然后,我们使用基于Transformer的解码器预测查询像素的输出颜色,其中条件基于关键点和从驾驶帧中提取的面部表情向量。 源人物的潜在表示是在自监督的方式下学习,将他们的外观、头姿势和面部表情分解成不同的组件。因此,它们非常适合跨复原。与大多数相关的工作不同,我们的方法自然地扩展到多个源图像,从而可以适应个性化的面部动态。我们还提出了数据增强和正则化方案,以防止过拟合和支持学习表示的泛化。我们在随机用户研究中评估了我们的方法。结果表明,与最先进的技术相比,在运动传递质量和时间一致性方面具有优越的性能。
https://arxiv.org/abs/2404.09736
Robots that assist in daily life are required to locate specific instances of objects that match the user's desired object in the environment. This task is known as Instance-Specific Image Goal Navigation (InstanceImageNav), which requires a model capable of distinguishing between different instances within the same class. One significant challenge in robotics is that when a robot observes the same object from various 3D viewpoints, its appearance may differ greatly, making it difficult to recognize and locate the object accurately. In this study, we introduce a method, SimView, that leverages multi-view images based on a 3D semantic map of the environment and self-supervised learning by SimSiam to train an instance identification model on-site. The effectiveness of our approach is validated using a photorealistic simulator, Habitat Matterport 3D, created by scanning real home environments. Our results demonstrate a 1.7-fold improvement in task accuracy compared to CLIP, which is pre-trained multimodal contrastive learning for object search. This improvement highlights the benefits of our proposed fine-tuning method in enhancing the performance of assistive robots in InstanceImageNav tasks. The project website is this https URL.
在日常生活中需要协助人类完成的机器人需要查找环境中与用户所需对象匹配的具体实例。这个任务被称为实例特定图像目标导航(InstanceImageNav),需要一个能够区分同一类中不同实例的模型。在机器人领域的一个关键挑战是,当机器人从不同的3D视角观察同一物体时,其外观可能会有很大差异,这使得准确识别和定位物体变得困难。在本文中,我们介绍了一种方法,SimView,它基于环境的三维语义图和SimSiam的自监督学习,在現場训练实例识别模型。我们通过使用扫描真实家庭环境的 photorealistic 3D 模拟器来验证我们方法的的有效性。我们的结果表明,与 CLIP 等先前的多模态对比学习对象搜索相比,我们的方法在任务准确性上提高了1.7倍。这一提高突出了我们在实例图像导航任务中提出的微调方法如何提高助手机器人在 InstanceImageNav 任务中的性能。项目网站是这个 https://www.example.com/。
https://arxiv.org/abs/2404.09647
Automatic 3D facial texture generation has gained significant interest recently. Existing approaches may not support the traditional physically based rendering pipeline or rely on 3D data captured by Light Stage. Our key contribution is a progressive latent space refinement approach that can bootstrap from 3D Morphable Models (3DMMs)-based texture maps generated from facial images to generate high-quality and diverse PBR textures, including albedo, normal, and roughness. It starts with enhancing Generative Adversarial Networks (GANs) for text-guided and diverse texture generation. To this end, we design a self-supervised paradigm to overcome the reliance on ground truth 3D textures and train the generative model with only entangled texture maps. Besides, we foster mutual enhancement between GANs and Score Distillation Sampling (SDS). SDS boosts GANs with more generative modes, while GANs promote more efficient optimization of SDS. Furthermore, we introduce an edge-aware SDS for multi-view consistent facial structure. Experiments demonstrate that our method outperforms existing 3D texture generation methods regarding photo-realistic quality, diversity, and efficiency.
自动3D面部纹理生成最近受到了广泛关注。现有的方法可能不支持传统的基于物理渲染管道,或者依赖于由光 stages捕获的3D数据。我们关键的贡献是一种渐进式的潜在空间细化方法,可以从基于面部图像生成的3D可塑模型(3DMM)纹理贴图开始,生成高质量和多样性的PBR纹理,包括Albedo、法和粗糙度。它从增强引导生成对抗网络(GANs)用于文本指导和大胆纹理生成开始。为此,我们设计了一个自监督的范式,以克服对真实3D纹理的依赖,并仅使用纠缠纹理贴图训练生成模型。此外,我们促进了GANs和评分蒸馏采样(SDS)之间的相互增强。SDS通过增加生成模式来提升GANs,而GANs通过更有效地优化SDS来推动SDS。此外,我们还引入了一个边缘感知的SDS,用于多视角一致的面部结构。实验证明,我们的方法在照片写实质量、多样性和效率方面超过了现有的3D纹理生成方法。
https://arxiv.org/abs/2404.09540
We introduce a novel approach to single image denoising based on the Blind Spot Denoising principle, which we call MAsked and SHuffled Blind Spot Denoising (MASH). We focus on the case of correlated noise, which often plagues real images. MASH is the result of a careful analysis to determine the relationships between the level of blindness (masking) of the input and the (unknown) noise correlation. Moreover, we introduce a shuffling technique to weaken the local correlation of noise, which in turn yields an additional denoising performance improvement. We evaluate MASH via extensive experiments on real-world noisy image datasets. We demonstrate on par or better results compared to existing self-supervised denoising methods.
我们提出了一种基于盲目点消除原理的新单图去噪方法,我们称之为Masked and Shuffled Blind Spot Denoising(MASH)。我们关注相关噪声在真实图像中的情况。MASH是通过对输入盲目程度(遮盖水平)与(未知)噪声相关性的仔细分析来确定的结果。此外,我们还引入了一种随机化技术来削弱噪声的局部相关性,从而进一步提高去噪效果。我们对现实世界中的嘈杂图像数据集进行广泛的实验评估。我们证明了MASH与现有自监督去噪方法的性能相当或者更好。
https://arxiv.org/abs/2404.09389
Among the ever-evolving development of vision-language models, contrastive language-image pretraining (CLIP) has set new benchmarks in many downstream tasks such as zero-shot classifications by leveraging self-supervised contrastive learning on large amounts of text-image pairs. However, its dependency on rigid one-to-one mappings overlooks the complex and often multifaceted relationships between and within texts and images. To this end, we introduce RankCLIP, a novel pretraining method that extends beyond the rigid one-to-one matching framework of CLIP and its variants. By leveraging both in-modal and cross-modal ranking consistency, RankCLIP improves the alignment process, enabling it to capture the nuanced many-to-many relationships between and within each modality. Through comprehensive experiments, we demonstrate the enhanced capability of RankCLIP to effectively improve performance across various downstream tasks, notably achieving significant gains in zero-shot classifications over state-of-the-art methods, underscoring the potential of RankCLIP in further advancing vision-language pretraining.
在视觉语言模型不断演变的背景下,对比语言-图像预训练(CLIP)已经在许多下游任务中达到了新的基准,例如利用大规模文本-图像对的自监督对比学习来进行零散shot分类。然而,它对固定一对一映射的依赖性忽视了文本和图像之间以及文本内部复杂多面之间的关系。为此,我们引入了RankCLIP,一种超越了CLIP及其变体的 rigid one-to-one matching 框架的新预训练方法。通过利用 both in-modal 和 cross-modal ranking consistency, RankCLIP 改善了alignment 过程,使其能够捕捉每个模态之间微妙的 many-to-many 关系。通过全面的实验,我们证明了 RankCLIP 在各种下游任务中的增强能力,特别是在零散shot分类方面取得了显著的进步,突出了 RankCLIP 在进一步推动视觉语言预训练方面的潜在可能性。
https://arxiv.org/abs/2404.09387
Few-shot knowledge distillation recently emerged as a viable approach to harness the knowledge of large-scale pre-trained models, using limited data and computational resources. In this paper, we propose a novel few-shot feature distillation approach for vision transformers. Our approach is based on two key steps. Leveraging the fact that vision transformers have a consistent depth-wise structure, we first copy the weights from intermittent layers of existing pre-trained vision transformers (teachers) into shallower architectures (students), where the intermittence factor controls the complexity of the student transformer with respect to its teacher. Next, we employ an enhanced version of Low-Rank Adaptation (LoRA) to distill knowledge into the student in a few-shot scenario, aiming to recover the information processing carried out by the skipped teacher layers. We present comprehensive experiments with supervised and self-supervised transformers as teachers, on five data sets from various domains, including natural, medical and satellite images. The empirical results confirm the superiority of our approach over competitive baselines. Moreover, the ablation results demonstrate the usefulness of each component of the proposed pipeline.
近年来,少样本知识蒸馏(Few-shot Knowledge Distillation)作为一种利用大型预训练模型的知识,同时具有有限数据和计算资源的方法,逐渐成为了一种可行的方式。在本文中,我们提出了一个新颖的少样本特征蒸馏方法,用于视觉Transformer。我们的方法基于两个关键步骤。利用视觉Transformer具有一致的深度结构的事实,我们首先将现有预训练视觉Transformer(教师)的权重从间歇层复制到较浅的架构(学生),其中间歇因子控制学生Transformer与教师Transformer的复杂程度。接下来,我们采用增强版本的低秩适应(LoRA)将知识在少样本场景中传递给学生,旨在恢复跳过的教师层所执行的信息处理。我们对包括自然、医学和卫星图像等五种数据集的监督和自监督Transformer作为教师进行了全面的实验。实验结果证实了我们在竞争基线上的优越性。此外,消融结果表明了每个组件在提出管道中的有用性。
https://arxiv.org/abs/2404.09326
Accurate completion and denoising of roof height maps are crucial to reconstructing high-quality 3D buildings. Repairing sparse points can enhance low-cost sensor use and reduce UAV flight overlap. RoofDiffusion is a new end-to-end self-supervised diffusion technique for robustly completing, in particular difficult, roof height maps. RoofDiffusion leverages widely-available curated footprints and can so handle up to 99\% point sparsity and 80\% roof area occlusion (regional incompleteness). A variant, No-FP RoofDiffusion, simultaneously predicts building footprints and heights. Both quantitatively outperform state-of-the-art unguided depth completion and representative inpainting methods for Digital Elevation Models (DEM), on both a roof-specific benchmark and the BuildingNet dataset. Qualitative assessments show the effectiveness of RoofDiffusion for datasets with real-world scans including AHN3, Dales3D, and USGS 3DEP LiDAR. Tested with the leading City3D algorithm, preprocessing height maps with RoofDiffusion noticeably improves 3D building reconstruction. RoofDiffusion is complemented by a new dataset of 13k complex roof geometries, focusing on long-tail issues in remote sensing; a novel simulation of tree occlusion; and a wide variety of large-area roof cut-outs for data augmentation and benchmarking.
准确地完成和去噪屋顶高度图对于重建高质量的3D建筑至关重要。修复稀疏点可以提高低成本传感器使用并减少无人机飞行重叠。RoofDiffusion是一种新的端到端自监督扩散技术,特别适用于完成艰难、高度不连续的屋顶高度图。RoofDiffusion利用广泛可用的心跳图,可以处理多达99%的点稀疏和80%的屋顶面积遮挡(区域不完整性)。一种变体,No-FP RoofDiffusion同时预测建筑轮廓和高度。在屋顶特定基准和BuildingNet数据集上,No-FP RoofDiffusion的定量效果超过了目前最先进的未经指导的深度完成和代表性的修复方法。定性评估显示,RoofDiffusion对于包括AHN3、Dales3D和USGS 3DEP LiDAR等现实世界扫描的数据集非常有效。使用领先的City3D算法进行测试,使用RoofDiffusion预处理屋顶图显著提高了3D建筑重建。RoofDiffusion通过一个新的具有13k个复杂屋顶几何的 datasets,重点关注遥感中的长尾问题;一种新的树遮挡模拟;以及各种大面积屋顶切口,用于数据增强和基准测试而得到了补充。
https://arxiv.org/abs/2404.09290
Self-supervised learning has emerged as a powerful way to pre-train generalizable machine learning models on large amounts of unlabeled data. It is particularly compelling in the music domain, where obtaining labeled data is time-consuming, error-prone, and ambiguous. During the self-supervised process, models are trained on pretext tasks, with the primary objective of acquiring robust and informative features that can later be fine-tuned for specific downstream tasks. The choice of the pretext task is critical as it guides the model to shape the feature space with meaningful constraints for information encoding. In the context of music, most works have relied on contrastive learning or masking techniques. In this study, we expand the scope of pretext tasks applied to music by investigating and comparing the performance of new self-supervised methods for music tagging. We open-source a simple ResNet model trained on a diverse catalog of millions of tracks. Our results demonstrate that, although most of these pre-training methods result in similar downstream results, contrastive learning consistently results in better downstream performance compared to other self-supervised pre-training methods. This holds true in a limited-data downstream context.
自监督学习已成为一种在大量未标注数据上预训练具有泛化机器学习模型的强大方法。在音乐领域,获得标注数据耗时、错误率高且具有不确定性。在自监督过程中,模型通过预处理任务进行训练,主要目标是在后续特定下游任务中获取稳健和有用的特征。预处理任务的选取对模型将如何对特征空间进行有意义的约束从而进行信息编码起着关键作用。在音乐领域,大多数作品都依赖对比学习或遮盖技术。在这项研究中,我们通过研究并比较新 Self-supervised 方法的音乐标签性能,扩展了应用于音乐的预处理任务的范畴。我们开源了一个基于多样化音乐目录的简单 ResNet 模型。我们的结果表明,尽管大多数预训练方法产生了类似的结果,但对比学习在下游性能方面始终优于其他 Self-supervised 预训练方法。在有限数据下游环境中,这一结论同样成立。
https://arxiv.org/abs/2404.09177
Deep clustering as an important branch of unsupervised representation learning focuses on embedding semantically similar samples into the identical feature space. This core demand inspires the exploration of contrastive learning and subspace clustering. However, these solutions always rely on the basic assumption that there are sufficient and category-balanced samples for generating valid high-level representation. This hypothesis actually is too strict to be satisfied for real-world applications. To overcome such a challenge, the natural strategy is utilizing generative models to augment considerable instances. How to use these novel samples to effectively fulfill clustering performance improvement is still difficult and under-explored. In this paper, we propose a novel Generative Calibration Clustering (GCC) method to delicately incorporate feature learning and augmentation into clustering procedure. First, we develop a discriminative feature alignment mechanism to discover intrinsic relationship across real and generated samples. Second, we design a self-supervised metric learning to generate more reliable cluster assignment to boost the conditional diffusion generation. Extensive experimental results on three benchmarks validate the effectiveness and advantage of our proposed method over the state-of-the-art methods.
深度聚类作为无监督表示学习的一个重要分支,专注于将语义相似的样本嵌入到相同的特征空间中。这一核心需求引发了对比学习以及子空间聚类的探索。然而,这些解决方案总是依赖于生成模型生成足够且类别平衡的样本来生成有效的高级表示的基本假设。这个假设实际上过于严格,无法满足现实世界的应用需求。为了克服这一挑战,自然策略是利用生成模型来增加大量的实例。然而,如何有效地利用这些新颖样本进行聚类性能的改进仍然很难,并且没有被充分探索。在本文中,我们提出了一种新颖的生成校准聚类(GCC)方法,将特征学习和增强融入聚类过程。首先,我们开发了一个判别特征对齐机制,以发现真实和生成样本之间的内在关系。其次,我们设计了一个自监督的度量学习,以生成更可靠的聚类分配来提高条件扩散生成。在三个基准测试上进行的大量实验结果证实了与最先进方法相比,我们提出的方法的有效性和优势。
https://arxiv.org/abs/2404.09115
Dynamic Facial Expression Recognition (DFER) has received significant interest in the recent years dictated by its pivotal role in enabling empathic and human-compatible technologies. Achieving robustness towards in-the-wild data in DFER is particularly important for real-world applications. One of the directions aimed at improving such models is multimodal emotion recognition based on audio and video data. Multimodal learning in DFER increases the model capabilities by leveraging richer, complementary data representations. Within the field of multimodal DFER, recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders. Another line of research has focused on adapting pre-trained static models for DFER. In this work, we propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders. We identify main challenges associated with this task, namely, intra-modality adaptation, cross-modal alignment, and temporal adaptation, and propose solutions to each of them. As a result, we demonstrate improvement over current state-of-the-art on two popular DFER benchmarks, namely DFEW and MFAW.
近年来,动态面部表情识别(DFER)由于其在实现富有同情心和人性化技术方面的关键作用而受到广泛关注。在DFER中实现对野外数据的稳健性对于现实应用尤为重要。旨在改善此类模型的一个方向是基于音频和视频数据的跨模态情感识别。在DFER的跨模态学习领域,最近的方法专注于利用自监督学习(SSL)预训练强大的多模态编码器。另一条研究路线专注于将预训练静态模型适应DFER。在这项工作中,我们提出了一个不同的观点来研究多模态DFER性能通过自监督预训练的分离单模态编码器进行调整。我们指出了这项任务中与该任务相关的主要挑战,即自模态适应、跨模态对齐和时间适应,并提出了针对每个问题的解决方案。结果,我们在两个流行的DFER基准上展示了改进,这两个基准分别是DFEW和MFAW。
https://arxiv.org/abs/2404.09010
Generalized Class Discovery (GCD) aims to dynamically assign labels to unlabelled data partially based on knowledge learned from labelled data, where the unlabelled data may come from known or novel classes. The prevailing approach generally involves clustering across all data and learning conceptions by prototypical contrastive learning. However, existing methods largely hinge on the performance of clustering algorithms and are thus subject to their inherent limitations. Firstly, the estimated cluster number is often smaller than the ground truth, making the existing methods suffer from the lack of prototypes for comprehensive conception learning. To address this issue, we propose an adaptive probing mechanism that introduces learnable potential prototypes to expand cluster prototypes (centers). As there is no ground truth for the potential prototype, we develop a self-supervised prototype learning framework to optimize the potential prototype in an end-to-end fashion. Secondly, clustering is computationally intensive, and the conventional strategy of clustering both labelled and unlabelled instances exacerbates this issue. To counteract this inefficiency, we opt to cluster only the unlabelled instances and subsequently expand the cluster prototypes with our introduced potential prototypes to fast explore novel classes. Despite the simplicity of our proposed method, extensive empirical analysis on a wide range of datasets confirms that our method consistently delivers state-of-the-art results. Specifically, our method surpasses the nearest competitor by a significant margin of \textbf{9.7}$\%$ within the Stanford Cars dataset and \textbf{12$\times$} clustering efficiency within the Herbarium 19 dataset. We will make the code and checkpoints publicly available at \url{this https URL}.
泛化类发现(GCD)旨在基于从已标注数据中学到的知识动态地为未标注数据分配标签,其中未标注数据可能来自已知的或新生的类别。当前的实现方法通常涉及对所有数据进行聚类,并通过原型对比学习来学习概念。然而,现有的方法很大程度上依赖于聚类算法的性能,因此它们受到其固有局限性的限制。首先,估计的聚类数量通常小于真实值,导致现有的方法在全面概念学习缺乏原型方面存在局限性。为了解决这个问题,我们提出了一个自适应探测机制,引入了可学习的潜在原型以扩展聚类原型(中心)。由于潜在原型的地面真没有给出,我们开发了一个自监督原型学习框架,以端到端地优化潜在原型。其次,聚类计算密集型,而传统聚类策略同时对已标注和未标注实例进行聚类,加剧了这一问题。为了应对这种低效性,我们选择仅聚类未标注实例,然后用我们引入的潜在原型扩展聚类原型,以快速探索新的类别。尽管我们提出的方法简单,但广泛的数据实证分析结果证实,我们的方法在各种数据集上 consistently实现了最先进的性能。具体来说,在我们的方法中,斯坦福汽车数据集上,我们的方法比最接近的竞争对手领先了9.7%的显著 margin,而在胡泊19数据集上,我们的方法具有12倍的聚类效率。我们将公开提供代码和检查点,在 this <https://url> 上。
https://arxiv.org/abs/2404.08995
Detecting various types of stresses (nutritional, water, nitrogen, etc.) in agricultural fields is critical for farmers to ensure maximum productivity. However, stresses show up in different shapes and sizes across different crop types and varieties. Hence, this is posed as an anomaly detection task in agricultural images. Accurate anomaly detection in agricultural UAV images is vital for early identification of field irregularities. Traditional supervised learning faces challenges in adapting to diverse anomalies, necessitating extensive annotated data. In this work, we overcome this limitation with self-supervised learning using a masked image modeling approach. Masked Autoencoders (MAE) extract meaningful normal features from unlabeled image samples which produces high reconstruction error for the abnormal pixels during reconstruction. To remove the need of using only ``normal" data while training, we use an anomaly suppression loss mechanism that effectively minimizes the reconstruction of anomalous pixels and allows the model to learn anomalous areas without explicitly separating ``normal" images for training. Evaluation on the Agriculture-Vision data challenge shows a mIOU score improvement in comparison to prior state of the art in unsupervised and self-supervised methods. A single model generalizes across all the anomaly categories in the Agri-Vision Challenge Dataset
检测不同类型的压力(包括营养、水、氮等)在农田中的情况对农民确保最大产量至关重要。然而,不同作物类型和品种的农田中,压力呈现出各种形状和大小。因此,在农业图像中,这被视为一个异常检测任务。在农业UAV图像中准确检测异常对于早期识别田地不规则是至关重要的。然而,传统的监督学习在适应多样异常方面面临挑战,需要大量注释数据。在本文中,我们通过使用带遮罩的图像建模方法克服了这一限制。带遮罩的自动编码器(MAE)从未标记的图像样本中提取有意义的正常特征,在重构过程中对异常像素产生高重建误差。为了在训练过程中不需要仅使用“正常”数据,我们使用异常抑制损失机制,有效地最小化异常像素的重建,并允许模型在训练过程中学习异常区域而无需明确分离“正常”图像。在农业视觉数据挑战上进行的评估显示,与未经监督和自监督方法的前沿状态相比,mIOU得分有所提高。在Agri-Vision挑战数据集上,单个模型在所有异常类别上具有泛化能力。
https://arxiv.org/abs/2404.08931
This paper is about effectively utilizing synthetic data for training deep neural networks for industrial parts classification, in particular, by taking into account the domain gap against real-world images. To this end, we introduce a synthetic dataset that may serve as a preliminary testbed for the Sim-to-Real challenge; it contains 17 objects of six industrial use cases, including isolated and assembled parts. A few subsets of objects exhibit large similarities in shape and albedo for reflecting challenging cases of industrial parts. All the sample images come with and without random backgrounds and post-processing for evaluating the importance of domain randomization. We call it Synthetic Industrial Parts dataset (SIP-17). We study the usefulness of SIP-17 through benchmarking the performance of five state-of-the-art deep network models, supervised and self-supervised, trained only on the synthetic data while testing them on real data. By analyzing the results, we deduce some insights on the feasibility and challenges of using synthetic data for industrial parts classification and for further developing larger-scale synthetic datasets. Our dataset and code are publicly available.
本文旨在有效地利用合成数据来训练工业部件分类深度神经网络,特别是考虑到领域差距与现实世界的图像。为此,我们引入了一个合成数据集,可以作为 Sim-to-Real 挑战的前期测试bed;它包含六个工业用例中的17个对象,包括隔离和组装部件。少数对象在形状和 albedo 方面具有很大的相似性,反映了工业部件具有挑战性的情况。所有的样本图像都带有一定的随机背景和后处理,用于评估领域随机化的重要性。我们称之为合成工业部件数据集(SIP-17)。我们通过比较使用 SIP-17 对五个最先进的深度网络模型的性能,包括监督和自监督模型,仅在合成数据上训练,然后在真实数据上测试,来研究 SIP-17 的实用性。通过分析结果,我们得出了一些关于使用合成数据进行工业部件分类的可行性和挑战性的见解,以及进一步开发更大规模合成数据集的挑战和思考。我们的数据集和代码都是公开可用的。
https://arxiv.org/abs/2404.08778
To make sense of their surroundings, intelligent systems must transform complex sensory inputs to structured codes that are reduced to task-relevant information such as object category. Biological agents achieve this in a largely autonomous manner, presumably via self-\allowbreak super-\allowbreak vised learning. Whereas previous attempts to model the underlying mechanisms were largely discriminative in nature, there is ample evidence that the brain employs a generative model of the world. Here, we propose that eye movements, in combination with the focused nature of primate vision, constitute a generative, self-supervised task of predicting and revealing visual information. We construct a proof-of-principle model starting from the framework of masked image modeling (MIM), a common approach in deep representation learning. To do so, we analyze how core components of MIM such as masking technique and data augmentation influence the formation of category-specific representations. This allows us not only to better understand the principles behind MIM, but to then reassemble a MIM more in line with the focused nature of biological perception. From a theoretical angle, we find that MIM disentangles neurons in latent space, a property that has been suggested to structure visual representations in primates, without explicit regulation. Together with previous findings of invariance learning, this highlights an interesting connection of MIM to latent regularization approaches for self-supervised learning. The source code is available under this https URL
为了理解他们周围的环境,智能系统必须将复杂的感官输入转换为结构化的代码,以便精简为任务相关的信息,如物体类别。生物智能体在很大程度上是自适应的,可能是通过自监督学习中的self-allowbreak super-allowbreak visibility学习来实现的。而之前尝试建模底层机制的尝试在很大程度上是歧视性的,证据表明,大脑采用了一种生成型的世界模型。在这里,我们提出,眼动和灵长类视觉集中精力的事实构成了一个生成、自监督的任务,预测和揭示视觉信息。我们从一个通用的图像建模(MIM)框架开始构建证明原则的模型,这是深度表示学习中的常见方法。为此,我们分析MIM中核心组件如遮罩技术和数据增强如何影响类别特定表示的形成。这使我们不仅能够更好地理解MIM的原理,而且能够重新构建一个更符合生物感知聚焦特点的MIM。从理论角度来看,我们发现MIM解离了潜在空间中的神经元,这种性质在灵长类动物中建议了视觉表示的结构。结合之前的惯性学习发现,这揭示了MIM与自监督学习中潜在规范方法之间的有趣联系。源代码可以在https://这个 URL上找到。
https://arxiv.org/abs/2404.08526
We study the problem of self-supervised 3D scene flow estimation from real large-scale raw point cloud sequences, which is crucial to various tasks like trajectory prediction or instance segmentation. In the absence of ground truth scene flow labels, contemporary approaches concentrate on deducing optimizing flow across sequential pairs of point clouds by incorporating structure based regularization on flow and object rigidity. The rigid objects are estimated by a variety of 3D spatial clustering methods. While state-of-the-art methods successfully capture overall scene motion using the Neural Prior structure, they encounter challenges in discerning multi-object motions. We identified the structural constraints and the use of large and strict rigid clusters as the main pitfall of the current approaches and we propose a novel clustering approach that allows for combination of overlapping soft clusters as well as non-overlapping rigid clusters representation. Flow is then jointly estimated with progressively growing non-overlapping rigid clusters together with fixed size overlapping soft clusters. We evaluate our method on multiple datasets with LiDAR point clouds, demonstrating the superior performance over the self-supervised baselines reaching new state of the art results. Our method especially excels in resolving flow in complicated dynamic scenes with multiple independently moving objects close to each other which includes pedestrians, cyclists and other vulnerable road users. Our codes will be publicly available.
我们研究从真实大型规模的点云序列中自监督3D场景流估计的问题,这对各种任务如轨迹预测或实例分割至关重要。在没有地面真实场景流标签的情况下,当代方法集中精力通过在流动和物体刚性上基于结构的正则化来推断序列点云之间的优化流动。刚性物体通过多种3D空间聚类方法估计。尽管最先进的方法能够通过神经先验结构成功捕捉整体场景运动,但他们在区分多对象运动时遇到了挑战。我们确定了当前方法的结构性约束以及大型严格刚性簇的使用是主要缺陷,并提出了一种新的聚类方法,允许结合重叠的软簇和非重叠的刚性簇表示。然后与固定大小的重叠软簇一起,使用共同增长的非重叠刚性簇估计流。我们在多个带有激光雷达点云的数据集上评估我们的方法,证明了自监督基线与自监督基线的优越性能,达到了新颖的最好结果。我们的方法在解决具有多个独立移动对象且彼此靠近的复杂动态场景中的流方面尤为出色。我们的代码将公开可用。
https://arxiv.org/abs/2404.08363
The field of Earth Observations (EO) offers a wealth of data from diverse sensors, presenting a great opportunity for advancing self-supervised multimodal learning. However, current multimodal EO datasets and models focus on a single data type, either mono-date images or time series, which limits their expressivity. We introduce OmniSat, a novel architecture that exploits the spatial alignment between multiple EO modalities to learn expressive multimodal representations without labels. To demonstrate the advantages of combining modalities of different natures, we augment two existing datasets with new modalities. As demonstrated on three downstream tasks: forestry, land cover classification, and crop mapping. OmniSat can learn rich representations in an unsupervised manner, leading to improved performance in the semi- and fully-supervised settings, even when only one modality is available for inference. The code and dataset are available at this http URL.
地球观测(EO)领域提供了来自各种传感器的丰富数据,这为自监督多模态学习提供了巨大的机会。然而,当前的EO数据集和模型集中只关注单一数据类型,无论是单日期图像还是时间序列,这限制了它们的表达力。我们引入了OmniSat,一种利用多个EO模态之间空间对齐学习丰富多模态表示的新颖架构。为了证明不同模态自然结合的优势,我们通过向两个现有的数据集添加新的模态来增强它们。在下游任务:林业、土地覆盖分类和农作物绘制。OmniSat可以在无监督的方式下学习丰富的表示,从而在仅有一个模态可用于推理的半监督和完全监督设置中实现更好的性能。代码和数据集可在该http URL找到。
https://arxiv.org/abs/2404.08351