Unsupervised object discovery is becoming an essential line of research for tackling recognition problems that require decomposing an image into entities, such as semantic segmentation and object detection. Recently, object-centric methods that leverage self-supervision have gained popularity, due to their simplicity and adaptability to different settings and conditions. However, those methods do not exploit effective techniques already employed in modern self-supervised approaches. In this work, we consider an object-centric approach in which DINO ViT features are reconstructed via a set of queried representations called slots. Based on that, we propose a masking scheme on input features that selectively disregards the background regions, inducing our model to focus more on salient objects during the reconstruction phase. Moreover, we extend the slot attention to a multi-query approach, allowing the model to learn multiple sets of slots, producing more stable masks. During training, these multiple sets of slots are learned independently while, at test time, these sets are merged through Hungarian matching to obtain the final slots. Our experimental results and ablations on the PASCAL-VOC 2012 dataset show the importance of each component and highlight how their combination consistently improves object localization. Our source code is available at: this https URL
无监督的对象发现已经成为解决需要将图像分解成实体以解决识别问题的关键技术之一。最近,利用自监督的方法在对象中心方法中取得了广泛的应用,因为它们的简单性和对不同设置和条件的适应性。然而,这些方法并没有充分利用现代自监督方法中已经采用的有效技术。在本文中,我们考虑了一种以对象为中心的方法,其中通过查询表示的槽来重构DINO ViT特征。基于这种方法,我们提出了一个在输入特征上应用遮罩方案,该方案选择性地忽略了背景区域,使我们的模型在重构阶段更加关注显著的物体。此外,我们还将槽注意扩展到多查询方法,使模型能够学习多个设置,从而产生更稳定的掩码。在训练过程中,这些多个设置是独立学习的,而在测试时,这些设置通过匈牙利匹配合并为最终设置。我们对PASCAL-VOC 2012数据集的实验结果和分析表明了每个组件的重要性,并强调了它们组合一致地改善了物体定位。我们的源代码可在此处访问:https:// this URL
https://arxiv.org/abs/2404.19654
Out-of-distribution (OOD) problems in few-shot classification (FSC) occur when novel classes sampled from testing distributions differ from base classes drawn from training distributions, which considerably degrades the performance of deep learning models deployed in real-world applications. Recent studies suggest that the OOD problems in FSC mainly including: (a) cross-domain few-shot classification (CD-FSC) and (b) spurious-correlation few-shot classification (SC-FSC). Specifically, CD-FSC occurs when a classifier learns transferring knowledge from base classes drawn from seen training distributions but recognizes novel classes sampled from unseen testing distributions. In contrast, SC-FSC arises when a classifier relies on non-causal features (or contexts) that happen to be correlated with the labels (or concepts) in base classes but such relationships no longer hold during the model deployment. Despite CD-FSC has been extensively studied, SC-FSC remains understudied due to lack of the corresponding evaluation benchmarks. To this end, we present Meta Concept Context (MetaCoCo), a benchmark with spurious-correlation shifts collected from real-world scenarios. Moreover, to quantify the extent of spurious-correlation shifts of the presented MetaCoCo, we further propose a metric by using CLIP as a pre-trained vision-language model. Extensive experiments on the proposed benchmark are performed to evaluate the state-of-the-art methods in FSC, cross-domain shifts, and self-supervised learning. The experimental results show that the performance of the existing methods degrades significantly in the presence of spurious-correlation shifts. We open-source all codes of our benchmark and hope that the proposed MetaCoCo can facilitate future research on spurious-correlation shifts problems in FSC. The code is available at: this https URL.
翻译:在少样本分类(FSC)中,异分布(OOD)问题主要包括:(a)跨域少样本分类(CD-FSC)和(b)伪相关少样本分类(SC-FSC)。具体来说,当分类器从可见训练分布中的基类学习知识,但同时从未见测试分布中采样新类时,就会出现CD-FSC。相反,当分类器依赖于与基类标签(或概念)相关但这种关系在模型部署时不再成立时,就会出现SC-FSC。尽管CD-FSC已经得到了广泛研究,但SC-FSC仍然因为没有相应的评估基准而备受忽视。为此,我们提出了元概念上下文(MetaCoCo),一个从现实世界场景中收集到的伪相关转移的基准。此外,为了量化所提出的MetaCoCo中伪相关转移的程度,我们进一步使用CLIP作为预训练的视觉语言模型提出了一个指标。对所提出的基准进行的大量实验用于评估FSC中的最先进方法、跨域转移和自监督学习的状态。实验结果表明,在存在伪相关转移的情况下,现有方法的性能显著下降。我们开源了我们的基准代码,并希望所提出的MetaCoCo能够促进未来关于FSC中伪相关转移问题的研究。代码可用在这个链接上:https://this URL。
https://arxiv.org/abs/2404.19644
In recent years, Event Sound Source Localization has been widely applied in various fields. Recent works typically relying on the contrastive learning framework show impressive performance. However, all work is based on large relatively simple datasets. It's also crucial to understand and analyze human behaviors (actions and interactions of people), voices, and sounds in chaotic events in many applications, e.g., crowd management, and emergency response services. In this paper, we apply the existing model to a more complex dataset, explore the influence of parameters on the model, and propose a semi-supervised improvement method SemiPL. With the increase in data quantity and the influence of label quality, self-supervised learning will be an unstoppable trend. The experiment shows that the parameter adjustment will positively affect the existing model. In particular, SSPL achieved an improvement of 12.2% cIoU and 0.56% AUC in Chaotic World compared to the results provided. The code is available at: this https URL
近年来,事件声音源本地化在各种领域得到了广泛应用。最近的工作通常依赖于对比学习框架,表现出令人印象深刻的性能。然而,所有工作都基于大型相对简单的数据集。还必须理解并分析许多应用中的人的行为(动作和人们的交互)、声音和声音在混乱事件中的情况,例如人员管理,紧急响应服务等。在本文中,我们将现有的模型应用于更复杂的数据集,探索参数对模型的影响,并提出了一种半监督改进方法 SemiPL。随着数据量的增加和标签质量的影响,自监督学习将成为不可阻挡的趋势。实验证明,参数调整会积极影响现有模型。特别是,SSPL在混乱世界中的改善率为12.2%的cIoU和0.56%的AUC,与提供的结果相比。代码可在此处下载:https:// this URL
https://arxiv.org/abs/2404.19615
In this paper, we propose a highly efficient method to estimate an image's mean opinion score (MOS) from a single opinion score (SOS). Assuming that each SOS is the observed sample of a normal distribution and the MOS is its unknown expectation, the MOS inference is formulated as a maximum likelihood estimation problem, where the perceptual correlation of pairwise images is considered in modeling the likelihood of SOS. More specifically, by means of the quality-aware representations learned from the self-supervised backbone, we introduce a learnable relative quality measure to predict the MOS difference between two images. Then, the current image's maximum likelihood estimation towards MOS is represented by the sum of another reference image's estimated MOS and their relative quality. Ideally, no matter which image is selected as the reference, the MOS of the current image should remain unchanged, which is termed perceptual cons tancy constrained calibration (PC3). Finally, we alternatively optimize the relative quality measure's parameter and the current image's estimated MOS via backpropagation and Newton's method respectively. Experiments show that the proposed method is efficient in calibrating the biased SOS and significantly improves IQA model learning when only SOSs are available.
在本文中,我们提出了一种从单个意见分数(SOS)估计图像平均评分(MOS)的高效方法。假设每个SOS是正态分布的观察样本,而MOS是它的未知期望。因此,MOS推理被视为最大似然估计问题,其中考虑了成对图像的感知相关性以建模SOS的概率。具体来说,通过自监督骨架学习到的质量感知表示,我们引入了一个可学习的相对质量度量以预测两个图像之间的MOS差。那么,当前图像对MOS的最大似然估计就可以表示为另一个参考图像的估计MOS和它们之间的相对质量之和。理想情况下,无论选择哪个图像作为参考,当前图像的MOS都应该保持不变,这被称为感知一致性约束调节(PC3)。最后,我们分别通过反向传播和牛顿法对相对质量度量的参数和当前图像的估计MOS进行优化。实验证明,与仅使用SOS时相比,所提出的方法在调节带有偏差SOS方面非常有效,并且当仅可用SOS时,IQA模型的学习显著提高。
https://arxiv.org/abs/2404.19595
Matching visible and near-infrared (NIR) images remains a significant challenge in remote sensing image fusion. The nonlinear radiometric differences between heterogeneous remote sensing images make the image matching task even more difficult. Deep learning has gained substantial attention in computer vision tasks in recent years. However, many methods rely on supervised learning and necessitate large amounts of annotated data. Nevertheless, annotated data is frequently limited in the field of remote sensing image matching. To address this challenge, this paper proposes a novel keypoint descriptor approach that obtains robust feature descriptors via a self-supervised matching network. A light-weight transformer network, termed as LTFormer, is designed to generate deep-level feature descriptors. Furthermore, we implement an innovative triplet loss function, LT Loss, to enhance the matching performance further. Our approach outperforms conventional hand-crafted local feature descriptors and proves equally competitive compared to state-of-the-art deep learning-based methods, even amidst the shortage of annotated data.
匹配可见和近红外(NIR)图像仍然是遥感图像融合的一个显著挑战。异质遥感图像的非线性辐射差异使得图像匹配任务变得更加困难。近年来,计算机视觉任务在深度学习领域得到了很多关注。然而,许多方法依赖于监督学习,需要大量标记数据。然而,遥感图像匹配领域标记数据经常有限。为了应对这个挑战,本文提出了一种新的关键点描述方法,通过自监督匹配网络获得稳健的特征描述符。一种轻量级的Transformer网络,被称为LTFormer,被设计用于生成深层次特征描述。此外,我们还实现了一种创新的三元组损失函数,称为LT Loss,以提高匹配性能。我们的方法在传统手工局部特征描述符的基础上表现优异,并证明了与基于深度学习的先进方法在竞争中具有竞争力,即使在标记数据有限的情况下。
https://arxiv.org/abs/2404.19311
Self-supervised learning (SSL) has developed rapidly in recent years. However, most of the mainstream methods are computationally expensive and rely on two (or more) augmentations for each image to construct positive pairs. Moreover, they mainly focus on large models and large-scale datasets, which lack flexibility and feasibility in many practical applications. In this paper, we propose an efficient single-branch SSL method based on non-parametric instance discrimination, aiming to improve the algorithm, model, and data efficiency of SSL. By analyzing the gradient formula, we correct the update rule of the memory bank with improved performance. We further propose a novel self-distillation loss that minimizes the KL divergence between the probability distribution and its square root version. We show that this alleviates the infrequent updating problem in instance discrimination and greatly accelerates convergence. We systematically compare the training overhead and performance of different methods in different scales of data, and under different backbones. Experimental results show that our method outperforms various baselines with significantly less overhead, and is especially effective for limited amounts of data and small models.
自监督学习(SSL)近年发展迅速。然而,大多数主流方法在计算上较为昂贵,并且需要每个图像使用两个(或更多)增强来构建 positive 对。此外,它们主要关注大规模模型和大规模数据集,在许多实际应用中缺乏灵活性和可行性。在本文中,我们提出了一种基于非参数实例区分的高效单分支 SSL 方法,旨在提高 SSL 的算法、模型和数据效率。通过分析梯度公式,我们改进了内存银行更新规则的性能。我们进一步提出了一种新的自蒸馏损失,该损失最小化概率分布和其平方根版本的 KL 散度。我们证明了这种方法减轻了实例分类中不频繁更新问题的影响,大大加速了收敛。我们系统地比较了不同数据规模和不同骨干网络下各种方法的训练开销和性能,实验结果表明,我们的方法在减少开销的同时显著优于各种基线,尤其对于有限数据和小型模型具有更好的效果。
https://arxiv.org/abs/2404.19289
Using unsupervised learning to disentangle speech into content, rhythm, pitch, and timbre for voice conversion has become a hot research topic. Existing works generally take into account disentangling speech components through human-crafted bottleneck features which can not achieve sufficient information disentangling, while pitch and rhythm may still be mixed together. There is a risk of information overlap in the disentangling process which results in less speech naturalness. To overcome such limits, we propose a two-stage model to disentangle speech representations in a self-supervised manner without a human-crafted bottleneck design, which uses the Mutual Information (MI) with the designed upper bound estimator (IFUB) to separate overlapping information between speech components. Moreover, we design a Joint Text-Guided Consistent (TGC) module to guide the extraction of speech content and eliminate timbre leakage issues. Experiments show that our model can achieve a better performance than the baseline, regarding disentanglement effectiveness, speech naturalness, and similarity. Audio samples can be found at this https URL.
使用无监督学习将语音分解为内容、节奏、音高和音色是语音转换的一个热门研究课题。现有的作品通常通过人工设计的瓶颈特征来分离语音组件,但这些特征无法实现足够的信息分离。而音高和节奏可能仍然混杂在一起。在分离过程中存在信息重叠的风险,导致语音的自然性降低。为了克服这些限制,我们提出了一个自监督的二维模型,用于在没有人类定制瓶颈设计的情况下分离语音表示。此外,我们还设计了一个联合文本引导一致(TGC)模块,以引导提取语音内容和消除音色泄漏问题。实验结果表明,我们的模型在分离效果、语音自然性和相似性方面都优于基线。音频样本可在此链接中找到:https://url.com/
https://arxiv.org/abs/2404.19212
Understanding the severity of conditions shown in images in medical diagnosis is crucial, serving as a key guide for clinical assessment, treatment, as well as evaluating longitudinal progression. This paper proposes Con- PrO: a novel representation learning method for severity assessment in medical images using Contrastive learningintegrated Preference Optimization. Different from conventional contrastive learning methods that maximize the distance between classes, ConPrO injects into the latent vector the distance preference knowledge between various severity classes and the normal class. We systematically examine the key components of our framework to illuminate how contrastive prediction tasks acquire valuable representations. We show that our representation learning framework offers valuable severity ordering in the feature space while outperforming previous state-of-the-art methods on classification tasks. We achieve a 6% and 20% relative improvement compared to a supervised and a self-supervised baseline, respectively. In addition, we derived discussions on severity indicators and related applications of preference comparison in the medical domain.
理解医学图像中显示病情的严重程度对于医疗诊断至关重要,作为临床评估、治疗以及评估病程进展的关键指导。本文提出了一种名为Con-PrO的新的图像严重程度评估方法,该方法使用对比学习集成偏好优化。与传统的对比学习方法不同,ConPrO将各种严重程度类之间的距离偏好知识注入到潜在向量中。我们系统地检查我们框架的关键组件,以阐明对比预测任务如何获得有价值的表示。我们证明了,与以前的最先进方法相比,我们的表示学习框架在分类任务上实现了6%和20%的相对改进。此外,我们讨论了病理性指标及其在医学领域中的相关应用。
https://arxiv.org/abs/2404.18831
This paper introduces YOLOv8-TO, a novel approach for reverse engineering of topology-optimized structures into interpretable geometric parameters using the YOLOv8 instance segmentation model. Density-based topology optimization methods require post-processing to convert the optimal density distribution into a parametric representation for design exploration and integration with CAD tools. Traditional methods such as skeletonization struggle with complex geometries and require manual intervention. YOLOv8-TO addresses these challenges by training a custom YOLOv8 model to automatically detect and reconstruct structural components from binary density distributions. The model is trained on a diverse dataset of both optimized and random structures generated using the Moving Morphable Components method. A custom reconstruction loss function based on the dice coefficient of the predicted geometry is used to train the new regression head of the model via self-supervised learning. The method is evaluated on test sets generated from different topology optimization methods, including out-of-distribution samples, and compared against a skeletonization approach. Results show that YOLOv8-TO significantly outperforms skeletonization in reconstructing visually and structurally similar designs. The method showcases an average improvement of 13.84% in the Dice coefficient, with peak enhancements reaching 20.78%. The method demonstrates good generalization to complex geometries and fast inference times, making it suitable for integration into design workflows using regular workstations. Limitations include the sensitivity to non-max suppression thresholds. YOLOv8-TO represents a significant advancement in topology optimization post-processing, enabling efficient and accurate reverse engineering of optimized structures for design exploration and manufacturing.
本文介绍了一种名为YOLOv8-TO的新方法,用于使用YOLOv8实例分割模型将拓扑优化结构反向工程为可解释的几何参数。密度基于拓扑优化方法需要后处理将最优密度分布转换为设计探索和CAD工具集成所需的参数表示。传统方法如骨架化在复杂几何图形上挣扎,并需要手动干预。YOLOv8-TO通过训练自适应检测和重构结构的YOLOv8模型来解决这些挑战。模型在通过自监督学习训练的新回归头的基础上进行训练,同时使用基于 dice 系数的自适应重构损失函数进行训练。该方法在从不同拓扑优化方法产生的测试集中进行评估,包括离散样本,并将其与骨架化方法进行比较。结果表明,YOLOv8-TO在重构视觉和结构相似的设计方面显著优于骨架化方法。该方法在 Dice 系数上展示了13.84%的改进,峰值增强达到20.78%。该方法具有良好的对复杂几何的泛化能力,并且具有快速的推理时间,使其适用于使用常规工作台进行设计工作流程的集成。局限性包括对非最大抑制阈值的敏感性。YOLOv8-TO在拓扑优化后处理方面取得了显著的进展,实现了对优化结构的高效且准确的逆向工程,以进行设计探索和制造。
https://arxiv.org/abs/2404.18763
Similar to humans, animals make extensive use of verbal and non-verbal forms of communication, including a large range of audio signals. In this paper, we address dog vocalizations and explore the use of self-supervised speech representation models pre-trained on human speech to address dog bark classification tasks that find parallels in human-centered tasks in speech recognition. We specifically address four tasks: dog recognition, breed identification, gender classification, and context grounding. We show that using speech embedding representations significantly improves over simpler classification baselines. Further, we also find that models pre-trained on large human speech acoustics can provide additional performance boosts on several tasks.
类似于人类,动物也广泛利用口头和非口头形式的交流方式,包括大量的音频信号。在本文中,我们关注狗的叫声,并探讨了使用预训练于人类语音的自监督语音表示模型来解决狗吠叫分类任务,这些任务在语音识别中的人类中心任务中发现了相似之处。我们特别解决了四个任务:狗识别、品种鉴定、性别分类和上下文绑定。我们发现,使用语音嵌入表示显著优于简单的分类基线。此外,我们还发现,预训练于大型人类语音声学的大规模模型可以在多个任务上提供额外的性能提升。
https://arxiv.org/abs/2404.18739
Purpose: Paranasal anomalies, frequently identified in routine radiological screenings, exhibit diverse morphological characteristics. Due to the diversity of anomalies, supervised learning methods require large labelled dataset exhibiting diverse anomaly morphology. Self-supervised learning (SSL) can be used to learn representations from unlabelled data. However, there are no SSL methods designed for the downstream task of classifying paranasal anomalies in the maxillary sinus (MS). Methods: Our approach uses a 3D Convolutional Autoencoder (CAE) trained in an unsupervised anomaly detection (UAD) framework. Initially, we train the 3D CAE to reduce reconstruction errors when reconstructing normal maxillary sinus (MS) image. Then, this CAE is applied to an unlabelled dataset to generate coarse anomaly locations by creating residual MS images. Following this, a 3D Convolutional Neural Network (CNN) reconstructs these residual images, which forms our SSL task. Lastly, we fine-tune the encoder part of the 3D CNN on a labelled dataset of normal and anomalous MS images. Results: The proposed SSL technique exhibits superior performance compared to existing generic self-supervised methods, especially in scenarios with limited annotated data. When trained on just 10% of the annotated dataset, our method achieves an Area Under the Precision-Recall Curve (AUPRC) of 0.79 for the downstream classification task. This performance surpasses other methods, with BYOL attaining an AUPRC of 0.75, SimSiam at 0.74, SimCLR at 0.73 and Masked Autoencoding using SparK at 0.75. Conclusion: A self-supervised learning approach that inherently focuses on localizing paranasal anomalies proves to be advantageous, particularly when the subsequent task involves differentiating normal from anomalous maxillary sinuses. Access our code at this https URL
目的:常规放射学检查中经常发现的paranasal异常表现出了多样的形态特征。由于异常的多样性,监督学习方法需要大型的带标签数据集来展示多样化的异常形态学。自监督学习(SSL)可用于从无标签数据中学习表示。然而,目前尚无针对下游分类任务(如 maxillary sinus 中的 paranasal anomalies)设计的自监督学习方法。 方法:我们的方法使用在无监督异常检测(UAD)框架中训练的 3D 卷积自动编码器(CAE)。首先,我们训练 3D CAE 以减少在重构正常 maxillary sinus(MS)图像时的重建误差。然后,将 CAE 应用于无标签数据集中,通过创建残余 MS 图像来生成粗异常位置。接下来,3D 卷积神经网络(CNN)重构这些残余图像,形成我们的 SSL 任务。最后,我们在带有正常和异常 MS 图像的带标签数据集上微调 3D CNN 的编码器部分。 结果:与现有的通用自监督方法相比,所提出的 SSL 技术表现出卓越的性能,尤其是在数据标注有限的情况下。当仅训练 10% 的带标签数据集时,我们的方法在下游分类任务的准确召回率曲线下的面积(AUPRC)为 0.79。这一性能超越了其他方法,BYOL 的 AUPRC 为 0.75,SimSiam 的 AUPRC 为 0.74,SimCLR 的 AUPRC 为 0.73,Masked Autoencoding using SparK 的 AUPRC 为 0.75。 结论:一种自监督学习方法,固守于局部定位paranasal异常,证明在后续任务区分正常和异常maxillary sinus时具有优势。请访问此链接查看我们的代码:https://www. this URL
https://arxiv.org/abs/2404.18599
This paper presents a novel approach to processing multimodal data for dynamic emotion recognition, named as the Multimodal Masked Autoencoder for Dynamic Emotion Recognition (MultiMAE-DER). The MultiMAE-DER leverages the closely correlated representation information within spatiotemporal sequences across visual and audio modalities. By utilizing a pre-trained masked autoencoder model, the MultiMAEDER is accomplished through simple, straightforward finetuning. The performance of the MultiMAE-DER is enhanced by optimizing six fusion strategies for multimodal input sequences. These strategies address dynamic feature correlations within cross-domain data across spatial, temporal, and spatiotemporal sequences. In comparison to state-of-the-art multimodal supervised learning models for dynamic emotion recognition, MultiMAE-DER enhances the weighted average recall (WAR) by 4.41% on the RAVDESS dataset and by 2.06% on the CREMAD. Furthermore, when compared with the state-of-the-art model of multimodal self-supervised learning, MultiMAE-DER achieves a 1.86% higher WAR on the IEMOCAP dataset.
本文提出了一种名为多模态掩码自动编码器(Multimodal Masked Autoencoder for Dynamic Emotion Recognition,MMMAER)的新方法来处理多模态数据以实现动态情感识别。MultiMAE-DER 通过利用预训练的掩码自动编码器模型,通过简单的直接微调实现。通过优化六种融合策略来提高MultiMAE-DER的表现,这些策略处理跨领域数据中的动态特征相关性。与用于动态情感识别的多模态监督学习模型相比,MultiMAE-DER在RAVDESS数据集上提高了4.41%的加权平均召回(WAR),在CREMAD数据集上提高了2.06%的WAR。此外,与用于多模态自监督学习的最先进模型相比,MultiMAE-DER在IEMOCAP数据集上实现了1.86%的WAR提升。
https://arxiv.org/abs/2404.18327
Existing X-ray based pre-trained vision models are usually conducted on a relatively small-scale dataset (less than 500k samples) with limited resolution (e.g., 224 $\times$ 224). However, the key to the success of self-supervised pre-training large models lies in massive training data, and maintaining high resolution in the field of X-ray images is the guarantee of effective solutions to difficult miscellaneous diseases. In this paper, we address these issues by proposing the first high-definition (1280 $\times$ 1280) X-ray based pre-trained foundation vision model on our newly collected large-scale dataset which contains more than 1 million X-ray images. Our model follows the masked auto-encoder framework which takes the tokens after mask processing (with a high rate) is used as input, and the masked image patches are reconstructed by the Transformer encoder-decoder network. More importantly, we introduce a novel context-aware masking strategy that utilizes the chest contour as a boundary for adaptive masking operations. We validate the effectiveness of our model on two downstream tasks, including X-ray report generation and disease recognition. Extensive experiments demonstrate that our pre-trained medical foundation vision model achieves comparable or even new state-of-the-art performance on downstream benchmark datasets. The source code and pre-trained models of this paper will be released on this https URL.
目前基于X光预训练的视觉模型通常是在相对较小的数据集(小于500k个样本)上进行的,分辨率有限(例如,224×224)。然而,自监督预训练大型模型的成功关键在于大规模训练数据,因此在X光图像领域保持高分辨率是解决复杂疾病有效解决方案的保证。在本文中,我们通过在新收集的大规模数据集上提出第一个高清晰度(1280×1280)X光基于预训练的基础视觉模型来解决这些问题。我们的模型遵循遮罩自动编码器框架,将经过遮罩处理的token(具有高率)作为输入,通过Transformer编码器-解码器网络对遮罩图像补丁进行重建。更重要的是,我们引入了一种新颖的上下文感知遮罩策略,利用胸廓作为自适应遮罩操作的边界。我们对我们的模型在两个下游任务(包括X光报告生成和疾病识别)的有效性进行了评估。大量实验证明,我们的预训练医疗基础视觉模型在下游基准数据集上实现了与最先进的性能相当或甚至更好的表现。本文的预训练模型和源代码将在这个[https://URL]上发布。
https://arxiv.org/abs/2404.17926
Radiology Report Generation (R2Gen) demonstrates how Multi-modal Large Language Models (MLLMs) can automate the creation of accurate and coherent radiological reports. Existing methods often hallucinate details in text-based reports that don't accurately reflect the image content. To mitigate this, we introduce a novel strategy, SERPENT-VLM (SElf Refining Radiology RePort GENeraTion using Vision Language Models), which improves the R2Gen task by integrating a self-refining mechanism into the MLLM framework. We employ a unique self-supervised loss that leverages similarity between pooled image representations and the contextual representations of the generated radiological text, alongside the standard Causal Language Modeling objective, to refine image-text representations. This allows the model to scrutinize and align the generated text through dynamic interaction between a given image and the generated text, therefore reducing hallucination and continuously enhancing nuanced report generation. SERPENT-VLM outperforms existing baselines such as LLaVA-Med, BiomedGPT, etc., achieving SoTA performance on the IU X-ray and Radiology Objects in COntext (ROCO) datasets, and also proves to be robust against noisy images. A qualitative case study emphasizes the significant advancements towards more sophisticated MLLM frameworks for R2Gen, opening paths for further research into self-supervised refinement in the medical imaging domain.
放射学报告生成(R2Gen)展示了如何使用多模态大型语言模型(MLLMs)自动创建准确且连贯的放射学报告。现有方法通常在基于文本的报告中扭曲了文本报告中不准确反映图像内容的细节。为了减轻这种现象,我们引入了一种新颖的策略:SERPENT-VLM(自监督优化放射学报告生成),它通过将自监督机制集成到MLLM框架中来改善R2Gen任务。我们采用了一种独特的自监督损失,该损失利用了聚类图像表示和生成放射学文本的上下文表示之间的相似性,以及标准的因果语言建模目标,来优化图像-文本表示。这使得模型可以通过在给定图像和生成文本之间进行动态交互来审查和调整生成的文本,从而减少扭曲并持续提高细微报告生成。SERPENT-VLM在Context(ROCO)数据集上的SoTA性能优于现有基线,如LLaVA-Med和BiomedGPT等,同时在嘈杂图像上表现出色。一个定性案例研究强调了在医学图像领域更复杂MLLM框架的显著进步,为在医疗成像领域进一步研究自监督优化提供了途径。
https://arxiv.org/abs/2404.17912
Efficiently capturing consistent and complementary semantic features in a multimodal conversation context is crucial for Multimodal Emotion Recognition in Conversation (MERC). Existing methods mainly use graph structures to model dialogue context semantic dependencies and employ Graph Neural Networks (GNN) to capture multimodal semantic features for emotion recognition. However, these methods are limited by some inherent characteristics of GNN, such as over-smoothing and low-pass filtering, resulting in the inability to learn long-distance consistency information and complementary information efficiently. Since consistency and complementarity information correspond to low-frequency and high-frequency information, respectively, this paper revisits the problem of multimodal emotion recognition in conversation from the perspective of the graph spectrum. Specifically, we propose a Graph-Spectrum-based Multimodal Consistency and Complementary collaborative learning framework GS-MCC. First, GS-MCC uses a sliding window to construct a multimodal interaction graph to model conversational relationships and uses efficient Fourier graph operators to extract long-distance high-frequency and low-frequency information, respectively. Then, GS-MCC uses contrastive learning to construct self-supervised signals that reflect complementarity and consistent semantic collaboration with high and low-frequency signals, thereby improving the ability of high and low-frequency information to reflect real emotions. Finally, GS-MCC inputs the collaborative high and low-frequency information into the MLP network and softmax function for emotion prediction. Extensive experiments have proven the superiority of the GS-MCC architecture proposed in this paper on two benchmark data sets.
有效地在多模态对话环境中捕捉到一致性和互补性的语义特征对于多模态情感识别(MERC)至关重要。现有的方法主要使用图结构来建模对话上下文语义关系,并使用图神经网络(GNN)来捕捉多模态语义特征以进行情感识别。然而,这些方法由于GNN的一些固有特性(如过度平滑和低通滤波),导致无法有效地学习长距离一致性和互补信息。由于一致性和互补性信息对应于低频和高频信息,因此本文从图谱的角度重新研究了对话中多模态情感识别的问题。具体来说,本文提出了一种基于图谱的跨模态一致性和互补性协同学习框架GS-MCC。首先,GS-MCC使用滑动窗口构建一个多模态交互图来建模对话关系,并使用高效的傅里叶图操作提取 long-distance high-frequency和low-frequency信息。然后,GS-MCC使用对比学习构建自监督信号,反映高和低频信号的互补性和一致性,从而提高高和低频信息对真实情感的反映能力。最后,GS-MCC将合作 high和low-frequency信息输入到MLP网络和软max函数进行情感预测。大量实验证明,本文提出的GS-MCC架构在两个基准数据集上的优越性。
https://arxiv.org/abs/2404.17862
Unpaired image dehazing (UID) holds significant research importance due to the challenges in acquiring haze/clear image pairs with identical backgrounds. This paper proposes a novel method for UID named Orthogonal Decoupling Contrastive Regularization (ODCR). Our method is grounded in the assumption that an image consists of both haze-related features, which influence the degree of haze, and haze-unrelated features, such as texture and semantic information. ODCR aims to ensure that the haze-related features of the dehazing result closely resemble those of the clear image, while the haze-unrelated features align with the input hazy image. To accomplish the motivation, Orthogonal MLPs optimized geometrically on the Stiefel manifold are proposed, which can project image features into an orthogonal space, thereby reducing the relevance between different features. Furthermore, a task-driven Depth-wise Feature Classifier (DWFC) is proposed, which assigns weights to the orthogonal features based on the contribution of each channel's feature in predicting whether the feature source is hazy or clear in a self-supervised fashion. Finally, a Weighted PatchNCE (WPNCE) loss is introduced to achieve the pulling of haze-related features in the output image toward those of clear images, while bringing haze-unrelated features close to those of the hazy input. Extensive experiments demonstrate the superior performance of our ODCR method on UID.
未配对图像去雾(UID)具有重要研究价值,因为获取具有相同背景的雾/清晰图像对具有挑战性。本文提出了一种名为Orthogonal Decoupling Contrastive Regularization(ODCR)的新方法来解决UID。我们的方法基于一个假设,即图像由雾相关特征和雾无关特征(如纹理和语义信息)组成。ODCR旨在确保去雾结果的雾相关特征与清晰图像的雾相关特征相似,而雾无关特征与输入雾图像对齐。为了实现这一目标,我们提出了Orthogonal MLPs优化几何地在Stiefel维度的方法,这些方法可以投影图像特征到正交空间,从而降低不同特征之间的相关性。此外,我们还提出了一个基于任务的有条件深度卷积特征分类器(DWFC),其中基于每个通道特征对预测功能源是否为雾或清晰进行自监督的贡献为权重分配给正交特征。最后,我们引入了加权PatchNCE(WPNCE)损失,以实现将输出图像中雾相关特征向清晰图像的雾相关特征的拉动,同时将雾无关特征带到雾输入的附近。大量实验证明,我们的ODCR方法在UID上具有卓越的性能。
https://arxiv.org/abs/2404.17825
We propose a novel point-based representation, Gaussian surfels, to combine the advantages of the flexible optimization procedure in 3D Gaussian points and the surface alignment property of surfels. This is achieved by directly setting the z-scale of 3D Gaussian points to 0, effectively flattening the original 3D ellipsoid into a 2D ellipse. Such a design provides clear guidance to the optimizer. By treating the local z-axis as the normal direction, it greatly improves optimization stability and surface alignment. While the derivatives to the local z-axis computed from the covariance matrix are zero in this setting, we design a self-supervised normal-depth consistency loss to remedy this issue. Monocular normal priors and foreground masks are incorporated to enhance the quality of the reconstruction, mitigating issues related to highlights and background. We propose a volumetric cutting method to aggregate the information of Gaussian surfels so as to remove erroneous points in depth maps generated by alpha blending. Finally, we apply screened Poisson reconstruction method to the fused depth maps to extract the surface mesh. Experimental results show that our method demonstrates superior performance in surface reconstruction compared to state-of-the-art neural volume rendering and point-based rendering methods.
我们提出了一个新的点状表示方法,Gaussian surfels,结合了3D Gaussian点灵活优化方法和表面对齐特性的优点。通过直接将3D Gaussian点的z坐标置为0,将原始的3D椭球体平扁成2D椭球体。这样的设计为优化器提供了明确的指导。将局部z轴视为法线方向,极大地提高了优化稳定性和平面对齐。尽管在当前设置中,从协方差矩阵计算的局部z轴导数为零,但我们设计了一种自监督的平滑-深度一致损失来解决这个问题。单目纹理和前景掩码被引入以提高复原质量,减轻关于高光和背景的问题。我们提出了一种体积切割方法,以聚合Gaussian surfels的信息,从而消除深度图生成的错误点。最后,我们将筛选后的Poisson重建方法应用于融合深度图,以提取表面网格。实验结果表明,与最先进的神经体积渲染和点状渲染方法相比,我们的方法在表面复原方面表现出卓越的性能。
https://arxiv.org/abs/2404.17774
In an era where the volume of data drives the effectiveness of self-supervised learning, the specificity and clarity of data semantics play a crucial role in model training. Addressing this, we introduce HYPerbolic Entailment filtering (HYPE), a novel methodology designed to meticulously extract modality-wise meaningful and well-aligned data from extensive, noisy image-text pair datasets. Our approach leverages hyperbolic embeddings and the concept of entailment cones to evaluate and filter out samples with meaningless or underspecified semantics, focusing on enhancing the specificity of each data sample. HYPE not only demonstrates a significant improvement in filtering efficiency but also sets a new state-of-the-art in the DataComp benchmark when combined with existing filtering techniques. This breakthrough showcases the potential of HYPE to refine the data selection process, thereby contributing to the development of more accurate and efficient self-supervised learning models. Additionally, the image specificity $\epsilon_{i}$ can be independently applied to induce an image-only dataset from an image-text or image-only data pool for training image-only self-supervised models and showed superior performance when compared to the dataset induced by CLIP score.
在数据驱动自我监督学习的时代,数据语义的具体性和清晰度在模型训练中起着关键作用。为了解决这个问题,我们引入了HYPerbolic Entailment filtering(HYPE),一种旨在仔细提取广泛、嘈杂图像-文本对数据集中的模态意义和良好对齐的数据的新方法。我们的方法利用了双曲嵌入和等价线概念来评估和过滤具有无意义或欠specified语义的数据样本,重点关注增强每个数据样本的特定性。HYPE不仅展示了在过滤效率方面的显著改进,而且在结合现有过滤技术后,在DataComp基准中达到了新的最先进水平。这一突破展示了HYPE改进数据选择过程的潜力,从而为更准确、高效的自我监督学习模型的发展做出了贡献。此外,图像特定性 $\epsilon_i$ 可以独立应用,从图像-文本或图像-only数据池中生成图像仅数据集,用于训练图像仅自我监督模型,并且与基于CLIP分数的数据集相比表现出优异的性能。
https://arxiv.org/abs/2404.17507
We propose a self-supervised approach for learning physics-based subspaces for real-time simulation. Existing learning-based methods construct subspaces by approximating pre-defined simulation data in a purely geometric way. However, this approach tends to produce high-energy configurations, leads to entangled latent space dimensions, and generalizes poorly beyond the training set. To overcome these limitations, we propose a self-supervised approach that directly minimizes the system's mechanical energy during training. We show that our method leads to learned subspaces that reflect physical equilibrium constraints, resolve overfitting issues of previous methods, and offer interpretable latent space parameters.
我们提出了一个基于物理的自我监督方法来学习实时模拟中的子空间。现有的基于学习的方法通过将预定义的模拟数据在纯几何方式上进行近似来构建子空间。然而,这种方法往往导致高能配置,导致纠缠的潜在空间维度,并且在训练集之外的表现不佳。为了克服这些限制,我们提出了一个基于自我监督的方法,该方法在训练期间直接最小化系统的机械能。我们证明了我们的方法导致学习到的子空间反映了物理平衡约束,解决了以前方法中的过拟合问题,并提供了可解释的潜在空间参数。
https://arxiv.org/abs/2404.17620
Transferring the weights of a pre-trained model to assist another task has become a crucial part of modern deep learning, particularly in data-scarce scenarios. Pre-training refers to the initial step of training models outside the current task of interest, typically on another dataset. It can be done via supervised models using human-annotated datasets or self-supervised models trained on unlabeled datasets. In both cases, many pre-trained models are available to fine-tune for the task of interest. Interestingly, research has shown that pre-trained models from ImageNet can be helpful for audio tasks despite being trained on image datasets. Hence, it's unclear whether in-domain models would be advantageous compared to competent out-domain models, such as convolutional neural networks from ImageNet. Our experiments will demonstrate the usefulness of in-domain models and datasets for bird species recognition by leveraging VICReg, a recent and powerful self-supervised method.
将预训练模型的权重应用于其他任务已成为现代深度学习的关键部分,尤其是在数据稀疏场景中。预训练指的是在当前感兴趣任务之外,通常使用人类标注数据或无标签数据集自监督地训练模型的初始步骤。在两种情况下,许多预训练模型可供调优以适应感兴趣的任务。有趣的是,研究表明,即使来自ImageNet的预训练模型在图像数据集上训练,它们也可以对音频任务有所帮助。因此,不清楚在领域内模型与来自ImageNet的具有相同能力的领域外模型之间的优劣。我们的实验将证明,通过利用VICReg,一个最近且强大的自监督方法,领域内模型和数据对鸟类物种识别的有用性。
https://arxiv.org/abs/2404.17252