Computed tomography (CT) is a widely used non-invasive medical imaging technique for disease diagnosis. The diagnostic accuracy is often affected by image resolution, which can be insufficient in practice. For medical CT images, the through-plane resolution is often worse than the in-plane resolution and there can be overlap between slices, causing difficulties in diagnoses. Self-supervised methods for through-plane resolution enhancement, which train on in-plane images and infer on through-plane images, have shown promise for both CT and MRI imaging. However, existing self-supervised methods either neglect overlap or can only handle specific cases with fixed combinations of resolution and overlap. To address these limitations, we propose a self-supervised method called SR4ZCT. It employs the same off-axis training approach while being capable of handling arbitrary combinations of resolution and overlap. Our method explicitly models the relationship between resolutions and voxel spacings of different planes to accurately simulate training images that match the original through-plane images. We highlight the significance of accurate modeling in self-supervised off-axis training and demonstrate the effectiveness of SR4ZCT using a real-world dataset.
计算断层扫描(CT)是一种广泛应用于疾病诊断的非侵入性医疗影像技术。诊断准确性通常受到图像分辨率的影响,在实际应用中可能不足以准确。对于医学CT图像,通过平面的分辨率通常比在平面分辨率更差,而且切片之间可能存在重叠,导致诊断困难。自监督方法通过在平面图像上训练并通过平面图像进行推断,对CT和MRI成像都显示出希望。然而,现有的自监督方法要么忽视重叠,要么只能处理具有固定组合分辨率和平面重叠的特定情况。为了克服这些限制,我们提出了一个名为SR4ZCT的自监督方法。它采用相同的离轴训练方法,同时能够处理任意组合的分辨率和重叠。我们的方法明确地建模了不同平面的分辨率和体素空间之间的关系,准确地模拟了与原始通过平面图像匹配的训练图像。我们强调了在自监督离轴训练中准确建模的重要性,并通过一个真实世界的数据集证明了SR4ZCT的有效性。
https://arxiv.org/abs/2405.02515
In this paper, we consider two challenging issues in reference-based super-resolution (RefSR) for smartphone, (i) how to choose a proper reference image, and (ii) how to learn RefSR in a self-supervised manner. Particularly, we propose a novel self-supervised learning approach for real-world RefSR from observations at dual and multiple camera zooms. Firstly, considering the popularity of multiple cameras in modern smartphones, the more zoomed (telephoto) image can be naturally leveraged as the reference to guide the super-resolution (SR) of the lesser zoomed (ultra-wide) image, which gives us a chance to learn a deep network that performs SR from the dual zoomed observations (DZSR). Secondly, for self-supervised learning of DZSR, we take the telephoto image instead of an additional high-resolution image as the supervision information, and select a center patch from it as the reference to super-resolve the corresponding ultra-wide image patch. To mitigate the effect of the misalignment between ultra-wide low-resolution (LR) patch and telephoto ground-truth (GT) image during training, we first adopt patch-based optical flow alignment and then design an auxiliary-LR to guide the deforming of the warped LR features. To generate visually pleasing results, we present local overlapped sliced Wasserstein loss to better represent the perceptual difference between GT and output in the feature space. During testing, DZSR can be directly deployed to super-solve the whole ultra-wide image with the reference of the telephoto image. In addition, we further take multiple zoomed observations to explore self-supervised RefSR, and present a progressive fusion scheme for the effective utilization of reference images. Experiments show that our methods achieve better quantitative and qualitative performance against state-of-the-arts. Codes are available at this https URL.
在本文中,我们考虑了在基于参考图像的超分辨率(RefSR)中 two 个具有挑战性的问题:(i)如何选择一个适当的参考图像,(ii)如何在自监督的方式下学习RefSR。特别地,我们提出了一种从双摄像头和多摄像头缩放的观察中进行真实世界RefSR的新型自监督学习方法。首先,考虑到现代智能手机中多个摄像头的流行,更缩放的(望远镜)图像可以自然地作为一个参考,以指导较小缩放(超广角)图像的超分辨率(SR),这给我们机会学习从双缩放观察中进行SR的深度网络。(ii)为了自监督学习DZSR,我们选择望远镜图像作为监督信息,并从它中选择一个中心补丁作为参考,以超分辨率相应的超广角图像补丁。为了减轻在训练过程中超广角低分辨率(LR)补丁与望远镜地面真实(GT)图像之间错位的影响,我们首先采用基于补丁的图像光束对齐,然后设计了一个辅助-LR,以指导失真 LR 特征的变形。为了生成视觉效果好的结果,我们提出了局部重叠的切削韦伯损失,更好地表示 GT 和输出在特征空间中的差异。在测试期间,DZSR可以直接部署用于解决整个超广角图像。此外,我们进一步进行了多次缩放观察,以探索自监督RefSR,并提出了参考图像的有效利用方案。实验结果表明,我们的方法在定量和定性方面都优于现有技术水平。代码可在此处下载:https://www.x剔除。
https://arxiv.org/abs/2405.02171
In this paper, we present a novel approach for text independent phone-to-audio alignment based on phoneme recognition, representation learning and knowledge transfer. Our method leverages a self-supervised model (wav2vec2) fine-tuned for phoneme recognition using a Connectionist Temporal Classification (CTC) loss, a dimension reduction model and a frame-level phoneme classifier trained thanks to forced-alignment labels (using Montreal Forced Aligner) to produce multi-lingual phonetic representations, thus requiring minimal additional training. We evaluate our model using synthetic native data from the TIMIT dataset and the SCRIBE dataset for American and British English, respectively. Our proposed model outperforms the state-of-the-art (charsiu) in statistical metrics and has applications in language learning and speech processing systems. We leave experiments on other languages for future work but the design of the system makes it easily adaptable to other languages.
在本文中,我们提出了一种基于语音识别、表示学习和知识传递的新方法来实现文本独立电话到音频对齐。我们的方法利用了一个自我监督模型(wav2vec2)经过CTC损失、维度减少模型和通过强制对齐标签(使用蒙特利尔强制对齐器)训练来进行语音识别,从而生成多语言语音表示,这使得我们无需额外训练。我们使用来自TIMIT数据集和SCRIBE数据集的合成本地数据来评估我们的模型。我们提出的方法在统计指标上超过了最先进的(charsiu)模型,并在语言学习和语音处理系统中具有应用。我们将继续进行其他语言的实验,但系统的设计使其容易适应其他语言。
https://arxiv.org/abs/2405.02124
This paper presents a novel self-supervised two-frame multi-camera metric depth estimation network, termed M${^2}$Depth, which is designed to predict reliable scale-aware surrounding depth in autonomous driving. Unlike the previous works that use multi-view images from a single time-step or multiple time-step images from a single camera, M${^2}$Depth takes temporally adjacent two-frame images from multiple cameras as inputs and produces high-quality surrounding depth. We first construct cost volumes in spatial and temporal domains individually and propose a spatial-temporal fusion module that integrates the spatial-temporal information to yield a strong volume presentation. We additionally combine the neural prior from SAM features with internal features to reduce the ambiguity between foreground and background and strengthen the depth edges. Extensive experimental results on nuScenes and DDAD benchmarks show M${^2}$Depth achieves state-of-the-art performance. More results can be found in this https URL .
本文提出了一种新颖的自监督两帧多相机metric深度估计网络,称为M2Depth,旨在在自动驾驶中预测可靠的尺度感知周围深度。与之前使用单个时间步或单个相机的多视角图像相比,M2Depth将来自多个摄像头的空间相邻的两帧图像作为输入,并产生高质量的周围深度。我们首先在空间和时间域分别构建成本体积,并提出了一个空间-时间融合模块,将空间-时间信息集成为一个强大的体积展示。此外,将SAM特征的神经先验与内部特征结合以减少前景和背景之间的歧义并加强深度边缘。在nuScenes和DDAD基准上进行的大量实验结果表明,M2Depth实现了与最先进技术相当的表现。更多结果可以在该https://url.org/ URL上找到。
https://arxiv.org/abs/2405.02004
Expressive voice conversion (VC) conducts speaker identity conversion for emotional speakers by jointly converting speaker identity and emotional style. Emotional style modeling for arbitrary speakers in expressive VC has not been extensively explored. Previous approaches have relied on vocoders for speech reconstruction, which makes speech quality heavily dependent on the performance of vocoders. A major challenge of expressive VC lies in emotion prosody modeling. To address these challenges, this paper proposes a fully end-to-end expressive VC framework based on a conditional denoising diffusion probabilistic model (DDPM). We utilize speech units derived from self-supervised speech models as content conditioning, along with deep features extracted from speech emotion recognition and speaker verification systems to model emotional style and speaker identity. Objective and subjective evaluations show the effectiveness of our framework. Codes and samples are publicly available.
表达性语音转换(VC)通过共同转换说话人身份和情感风格来对情感说话人进行演讲者身份转换。对于表达性VC中任意说话人的情感风格建模,尚未进行过深入探讨。之前的解决方案依赖于语音重建器的语音重建,这使得语音质量高度依赖于语音重建器的表现。情感语音转换的一个主要挑战是情感语调建模。为了应对这些挑战,本文基于条件去噪扩散概率模型(DDPM)提出了一种完整的端到端情感语音转换框架。我们利用自监督语音模型产生的语音单元作为内容条件,同时从语音情感识别和说话人验证系统中提取深度特征来建模情感风格和说话人身份。客观和主观评估结果表明,我们的框架的有效性。代码和样本公开可用。
https://arxiv.org/abs/2405.01730
Self-supervised learning (SSL) has emerged as a key technique for training networks that can generalize well to diverse tasks without task-specific supervision. This property makes SSL desirable for computational pathology, the study of digitized images of tissues, as there are many target applications and often limited labeled training samples. However, SSL algorithms and models have been primarily developed in the field of natural images and whether their performance can be improved by adaptation to particular domains remains an open question. In this work, we present an investigation of modifications to SSL for pathology data, specifically focusing on the DINOv2 algorithm. We propose alternative augmentations, regularization functions, and position encodings motivated by the characteristics of pathology images. We evaluate the impact of these changes on several benchmarks to demonstrate the value of tailored approaches.
自监督学习(SSL)作为一种能够在没有任务特定监督的情况下泛化良好的网络训练技术,成为了一个关键的技术,尤其是在计算病理学中,病理学研究的数字图像。这种特性使得 SSL 成为计算病理学研究的理想选择,因为该领域有许多目标应用,但通常缺乏足够的标记训练样本。然而, SSL 算法和模型主要在自然图像领域开发,其性能是否可以通过适应特定领域来提高仍然是一个未解决的问题。在这项工作中,我们研究了针对病理数据的对 SSL 的修改,特别关注 DINOv2 算法。我们提出了由病理图像特征启发的替代增强、正则化函数和位置编码。我们评估了这些变化对多个基准测试的影响,以展示定制方法的价值。
https://arxiv.org/abs/2405.01688
Satellite image time series (SITS) segmentation is crucial for many applications like environmental monitoring, land cover mapping and agricultural crop type classification. However, training models for SITS segmentation remains a challenging task due to the lack of abundant training data, which requires fine grained annotation. We propose S4 a new self-supervised pre-training approach that significantly reduces the requirement for labeled training data by utilizing two new insights: (a) Satellites capture images in different parts of the spectrum such as radio frequencies, and visible frequencies. (b) Satellite imagery is geo-registered allowing for fine-grained spatial alignment. We use these insights to formulate pre-training tasks in S4. We also curate m2s2-SITS, a large-scale dataset of unlabeled, spatially-aligned, multi-modal and geographic specific SITS that serves as representative pre-training data for S4. Finally, we evaluate S4 on multiple SITS segmentation datasets and demonstrate its efficacy against competing baselines while using limited labeled data.
卫星图像时间序列(SITS)分割对于许多应用,如环境监测、土地覆盖图和农业作物类型分类,至关重要。然而,为SITS分割训练模型仍然具有挑战性,因为缺乏丰富的训练数据,这需要精细标注。我们提出S4一种新的自监督预训练方法,它通过利用两个新的见解显著减少了标注训练数据的需求:(一)卫星捕捉不同频段(如无线电频段和可见频段)的图像;(二)卫星影像实现空间对齐,实现细粒度空间对齐。我们利用这些见解为S4中的预训练任务制定形式。我们还整理了m2s2-SITS,一个大量的、无标签、空间对齐、多模态和地理特定的SITS数据集,作为S4的代表性预训练数据。最后,我们在多个SITS分割数据集上评估S4,并使用有限的标记数据证明了其对抗性基线的有效性。
https://arxiv.org/abs/2405.01656
AI Foundation models are gaining traction in various applications, including medical fields like radiology. However, medical foundation models are often tested on limited tasks, leaving their generalisability and biases unexplored. We present RayDINO, a large visual encoder trained by self-supervision on 873k chest X-rays. We compare RayDINO to previous state-of-the-art models across nine radiology tasks, from classification and dense segmentation to text generation, and provide an in depth analysis of population, age and sex biases of our model. Our findings suggest that self-supervision allows patient-centric AI proving useful in clinical workflows and interpreting X-rays holistically. With RayDINO and small task-specific adapters, we reach state-of-the-art results and improve generalization to unseen populations while mitigating bias, illustrating the true promise of foundation models: versatility and robustness.
AI 基础模型在各种应用领域都取得了越来越广泛的应用,包括医学领域如 radiology。然而,这些模型通常仅在有限的任务上进行测试,导致其泛化能力和偏见未被探索。我们提出了 RayDINO,一种通过自监督学习在 873k 张胸部 X 光片上进行训练的大视觉编码器。我们比较了 RayDINO 与之前最先进的模型在九个放射学任务上的效果,从分类和密集分割到文本生成,并对我们模型的Population、Age和Gender偏见进行了深入分析。我们的研究结果表明,自监督训练使患者为中心的 AI 在临床工作流程中具有实际价值,并能够全面地解释 X 光片。与 RayDINO 和针对具体任务的适配器相结合,我们达到了最先进的结果,并在未见过的受众上进行了泛化能力的提高,同时减轻了偏见,证明了基础模型的真正价值:多样性和稳健性。
https://arxiv.org/abs/2405.01469
An accurate detection and tracking of devices such as guiding catheters in live X-ray image acquisitions is an essential prerequisite for endovascular cardiac interventions. This information is leveraged for procedural guidance, e.g., directing stent placements. To ensure procedural safety and efficacy, there is a need for high robustness no failures during tracking. To achieve that, one needs to efficiently tackle challenges, such as: device obscuration by contrast agent or other external devices or wires, changes in field-of-view or acquisition angle, as well as the continuous movement due to cardiac and respiratory motion. To overcome the aforementioned challenges, we propose a novel approach to learn spatio-temporal features from a very large data cohort of over 16 million interventional X-ray frames using self-supervision for image sequence data. Our approach is based on a masked image modeling technique that leverages frame interpolation based reconstruction to learn fine inter-frame temporal correspondences. The features encoded in the resulting model are fine-tuned downstream. Our approach achieves state-of-the-art performance and in particular robustness compared to ultra optimized reference solutions (that use multi-stage feature fusion, multi-task and flow regularization). The experiments show that our method achieves 66.31% reduction in maximum tracking error against reference solutions (23.20% when flow regularization is used); achieving a success score of 97.95% at a 3x faster inference speed of 42 frames-per-second (on GPU). The results encourage the use of our approach in various other tasks within interventional image analytics that require effective understanding of spatio-temporal semantics.
准确检测和跟踪诸如引导导管等设备在活体X光成像中的操作,是进行内窥镜心脏干预的必要前提。这一信息用于指导操作,例如指导支架植入。为了确保操作的安全性和有效性,需要在跟踪过程中具备高鲁棒性,以避免失败。为了实现这一目标,需要有效地解决一些挑战,例如:对比剂或其他外部设备或导线对设备的遮挡,视野或成像角度的变化,以及由于心脏和呼吸运动而产生的连续运动。为了克服上述挑战,我们提出了一个新颖的方法,从超过1600万干预X光帧的大型数据集中学取空间-时间特征,通过自监督图像序列数据进行图像序列数据。我们的方法基于遮罩图像建模技术,利用基于重构的帧插值学习细粒度时间对应关系。在得到的模型中编码的特征经过下游微调。我们的方法在超优化参考解决方案(使用多级特征融合、多任务和流 regularization)方面实现了最先进的性能和鲁棒性。实验结果表明,我们的方法将最大跟踪误差减少了66.31%(使用流 regularization 时,降低了23.20%);在每秒42帧的推理速度下,实现了97.95%的成功率(在GPU上)。结果鼓励将我们的方法应用于各种需要在操作图像分析中有效理解空间-时间语义的各种其他任务。
https://arxiv.org/abs/2405.01156
Self-supervised learning for image denoising problems in the presence of denaturation for noisy data is a crucial approach in machine learning. However, theoretical understanding of the performance of the approach that uses denatured data is lacking. To provide better understanding of the approach, in this paper, we analyze a self-supervised denoising algorithm that uses denatured data in depth through theoretical analysis and numerical experiments. Through the theoretical analysis, we discuss that the algorithm finds desired solutions to the optimization problem with the population risk, while the guarantee for the empirical risk depends on the hardness of the denoising task in terms of denaturation levels. We also conduct several experiments to investigate the performance of an extended algorithm in practice. The results indicate that the algorithm training with denatured images works, and the empirical performance aligns with the theoretical results. These results suggest several insights for further improvement of self-supervised image denoising that uses denatured data in future directions.
自我监督学习在噪声数据中进行图像去噪问题是机器学习中的一个关键方法。然而,对于使用去噪数据的开源算法的性能理解还存在理论上的不足。为了更好地理解这种方法,本文分析了一种使用去噪数据的深度自我监督去噪算法,并通过理论分析和数值实验进行了分析。通过理论分析,我们讨论了该算法在优化问题中找到所需解的问题,而保证经验风险的保证取决于去噪任务的复杂程度。我们还进行了一些实验,研究了使用去噪图像进行训练的扩展算法的实际效果。结果表明,该算法基于去噪图像的训练是有效的,而经验性能与理论结果一致。这些结果为未来使用去噪数据进行图像去噪的自我监督学习提供了几个有价值的启示。
https://arxiv.org/abs/2405.01124
The goal of generality in machine learning is to achieve excellent performance on various unseen tasks and domains. Recently, self-supervised learning (SSL) has been regarded as an effective method to achieve this goal. It can learn high-quality representations from unlabeled data and achieve promising empirical performance on multiple downstream tasks. Existing SSL methods mainly constrain generality from two aspects: (i) large-scale training data, and (ii) learning task-level shared knowledge. However, these methods lack explicit modeling of the SSL generality in the learning objective, and the theoretical understanding of SSL's generality remains limited. This may cause SSL models to overfit in data-scarce situations and generalize poorly in the real world, making it difficult to achieve true generality. To address these issues, we provide a theoretical definition of generality in SSL and define a $\sigma$-measurement to help quantify it. Based on this insight, we explicitly model generality into self-supervised learning and further propose a novel SSL framework, called GeSSL. It introduces a self-motivated target based on $\sigma$-measurement, which enables the model to find the optimal update direction towards generality. Extensive theoretical and empirical evaluations demonstrate the superior performance of the proposed GeSSL.
泛化在机器学习中的目标是实现对各种未见任务和领域的卓越性能。近年来,自监督学习(SSL)被认为是实现这一目标的有效方法。它可以从未标记数据中学习高质量表示,并在多个下游任务上实现鼓舞人心的实证性能。现有的SSL方法主要从两个方面约束泛化:(i)大规模训练数据,(ii)学习任务级别共享知识。然而,这些方法在学习目标中没有明确建模SSL的泛化,而SSL的泛化理论理解仍然有限。这可能导致在数据稀疏情况下,SSL模型过拟合,并且在现实世界中表现不佳,使得实现真正的泛化变得困难。为了解决这些问题,我们提供了SSL中泛化的理论定义,并定义了一个$\sigma$度量来帮助度量它。基于这个洞见,我们明确地将泛化建模到自监督学习之中,并进一步提出了名为GeSSL的新SSL框架。它引入了一个基于$\sigma$度量的自激励目标,使模型能够找到向泛化最优更新方向的优化方向。大量的理论化和实证评价证明了所提出的GeSSL具有卓越的性能。
https://arxiv.org/abs/2405.01053
Despite the remarkable success of Large Language Models (LLMs) in text understanding and generation, their potential for text clustering tasks remains underexplored. We observed that powerful closed-source LLMs provide good quality clusterings of entity sets but are not scalable due to the massive compute power required and the associated costs. Thus, we propose CACTUS (Context-Aware ClusTering with aUgmented triplet losS), a systematic approach that leverages open-source LLMs for efficient and effective supervised clustering of entity subsets, particularly focusing on text-based entities. Existing text clustering methods fail to effectively capture the context provided by the entity subset. Moreover, though there are several language modeling based approaches for clustering, very few are designed for the task of supervised clustering. This paper introduces a novel approach towards clustering entity subsets using LLMs by capturing context via a scalable inter-entity attention mechanism. We propose a novel augmented triplet loss function tailored for supervised clustering, which addresses the inherent challenges of directly applying the triplet loss to this problem. Furthermore, we introduce a self-supervised clustering task based on text augmentation techniques to improve the generalization of our model. For evaluation, we collect ground truth clusterings from a closed-source LLM and transfer this knowledge to an open-source LLM under the supervised clustering framework, allowing a faster and cheaper open-source model to perform the same task. Experiments on various e-commerce query and product clustering datasets demonstrate that our proposed approach significantly outperforms existing unsupervised and supervised baselines under various external clustering evaluation metrics.
尽管大型语言模型(LLMs)在文本理解和生成方面的表现非常出色,但它们在文本聚类的潜力仍然没有被充分利用。我们观察到,强大的闭源LLM在实体集合上提供了良好的聚类质量,但由于所需的大量计算能力和相关成本,它们不具有可扩展性。因此,我们提出了CACTUS(带有自适应分段丢失的上下文感知聚类 with aUgmented triplet loss),一种系统化方法,利用开源LLM进行有效地和有效的监督性实体子集聚类,特别是关注文本实体。现有的文本聚类方法未能有效捕捉实体子集提供的上下文。此外,尽管有许多基于语言模型的聚类方法,但几乎没有专门针对监督聚类设计的。本文提出了一种通过可扩展的跨实体关注机制对LLM进行聚类的新方法。我们提出了一个全新的监督聚类损失函数, tailored for supervised clustering,解决了直接将三元组损失应用于这个问题固有的挑战。此外,我们还引入了一个基于文本增强技术的自监督聚类任务,以提高模型的泛化能力。为了评估,我们在一个闭源LLM上收集了地面真实聚类,并将其知识转移到开源LLM上,在监督聚类框架下实现更快的免费模型执行相同任务。在各种电子商务查询和产品聚类数据集上进行实验证明,我们提出的方法在各种外部聚类评估指标上显著优于现有的无监督和监督基线。
https://arxiv.org/abs/2405.00988
Background and Purpose: Identifying the thromboembolism source in ischemic stroke is crucial for treatment and secondary prevention yet is often undetermined. This study describes a self-supervised deep learning approach in digital pathology of emboli for classifying ischemic stroke clot origin from histopathological images. Methods: The dataset included whole slide images (WSI) from the STRIP AI Kaggle challenge, consisting of retrieved clots from ischemic stroke patients following mechanical thrombectomy. Transformer-based deep learning models were developed using transfer learning and self-supervised pretraining for classifying WSI. Customizations included an attention pooling layer, weighted loss function, and threshold optimization. Various model architectures were tested and compared, and model performances were primarily evaluated using weighted logarithmic loss. Results: The model achieved a logloss score of 0.662 in cross-validation and 0.659 on the test set. Different model backbones were compared, with the swin_large_patch4_window12_384 showed higher performance. Thresholding techniques for clot origin classification were employed to balance false positives and negatives. Conclusion: The study demonstrates the extent of efficacy of transformer-based deep learning models in identifying ischemic stroke clot origins from histopathological images and emphasizes the need for refined modeling techniques specifically adapted to thrombi WSI. Further research is needed to improve model performance, interpretability, validate its effectiveness. Future enhancement could include integrating larger patient cohorts, advanced preprocessing strategies, and exploring ensemble multimodal methods for enhanced diagnostic accuracy.
背景和目的:确定动脉粥样硬化性中风血栓的来源对于治疗和二次预防至关重要,但通常很难确定。这项研究描述了一种自监督的深度学习方法,用于病理图像中的血栓分类,以确定动脉粥样硬化性中风血栓的来源。 方法:数据集包括从STRIP AI Kaggle挑战中获取的整张图像(WSI),这些 WSI 是来自接受机械取栓治疗的患者。使用迁移学习和自监督预训练的Transformer-based深度学习模型进行分类。自定义包括注意力池化层、加权损失函数和阈值优化。各种模型架构都被测试和比较,主要通过加权对数损失进行评估来评估模型性能。 结果:在交叉验证中,模型实现了logloss分数为0.662,在测试集中为0.659。对不同的模型骨干进行了比较, swin_large_patch4_window12_384 显示了更高的性能。采用阈值技术平衡 false positives 和 negatives。 结论:本研究证明了Transformer-based深度学习模型在从病理图像中确定动脉粥样硬化性中风血栓来源方面的有效性。强调了需要专门针对血栓 WSI 的精细建模技术。还需要进一步研究提高模型性能、可解释性和验证其有效性。未来的增强可以包括纳入更大的患者队列、采用更先进的预处理策略和探索集成多模态方法以提高诊断准确性。
https://arxiv.org/abs/2405.00908
Image Quality Assessment (IQA) is essential in various Computer Vision tasks such as image deblurring and super-resolution. However, most IQA methods require reference images, which are not always available. While there are some reference-free IQA metrics, they have limitations in simulating human perception and discerning subtle image quality variations. We hypothesize that the JPEG quality factor is representatives of image quality measurement, and a well-trained neural network can learn to accurately evaluate image quality without requiring a clean reference, as it can recognize image degradation artifacts based on prior knowledge. Thus, we developed a reference-free quality evaluation network, dubbed "Quality Factor (QF) Predictor", which does not require any reference. Our QF Predictor is a lightweight, fully convolutional network comprising seven layers. The model is trained in a self-supervised manner: it receives JPEG compressed image patch with a random QF as input, is trained to accurately predict the corresponding QF. We demonstrate the versatility of the model by applying it to various tasks. First, our QF Predictor can generalize to measure the severity of various image artifacts, such as Gaussian Blur and Gaussian noise. Second, we show that the QF Predictor can be trained to predict the undersampling rate of images reconstructed from Magnetic Resonance Imaging (MRI) data.
图像质量评估(IQA)在各种计算机视觉任务中(如图像去噪和超分辨率)非常重要。然而,大多数IQA方法都需要参考图像,这些图像并不总是可用的。虽然有一些无需参考图像的IQA指标,但它们在模拟人类感知和辨别细微图像质量变化方面存在局限性。我们假设JPEG质量因子是图像质量测量的代表,并且经过良好训练的神经网络可以准确评估图像质量,而不需要干净的参考,因为它可以根据先验知识识别图像退化伪像。因此,我们开发了一个无需参考的图像质量评估网络,名为“质量因子(QF)预测器”,它包含七个层。该模型以自监督的方式训练:它接收经过随机QF的压缩JPEG图像补丁作为输入,并通过准确预测相应的QF进行训练。我们通过应用该模型到各种任务来展示其多才性。首先,我们的QF预测器可以推广用于衡量各种图像伪像的严重程度,如高斯平滑和高斯噪声。其次,我们证明了QF预测器可以被训练预测从磁共振成像(MRI)数据中重构的图像的降采样率。
https://arxiv.org/abs/2405.02208
Voice conversion is the task to transform voice characteristics of source speech while preserving content information. Nowadays, self-supervised representation learning models are increasingly utilized in content extraction. However, in these representations, a lot of hidden speaker information leads to timbre leakage while the prosodic information of hidden units lacks use. To address these issues, we propose a novel framework for expressive voice conversion called "SAVC" based on soft speech units from HuBert-soft. Taking soft speech units as input, we design an attribute encoder to extract content and prosody features respectively. Specifically, we first introduce statistic perturbation imposed by adversarial style augmentation to eliminate speaker information. Then the prosody is implicitly modeled on soft speech units with knowledge distillation. Experiment results show that the intelligibility and naturalness of converted speech outperform previous work.
语音转换是将原始语音的语音特征进行转换,同时保留内容信息的过程。如今,自监督表示学习模型在内容提取中越来越受到欢迎。然而,在这些表示中,许多隐藏的说话者信息导致谐波泄漏,而隐藏单元的语调信息则缺乏使用。为了解决这些问题,我们提出了一种名为"SAVC"的新框架,基于HuBert-soft中的软语音单位。作为输入,我们设计了一个属性编码器来提取内容特征和语调特征。具体来说,我们首先引入了由对抗风格增强带来的统计畸变,以消除说话者信息。然后,我们通过知识蒸馏在软语音单位上隐含了语调信息。实验结果表明,转换后的语音的可听性和自然性超过了之前的 work。
https://arxiv.org/abs/2405.00603
In this paper, we investigate self-supervised pre-training methods for document text recognition. Nowadays, large unlabeled datasets can be collected for many research tasks, including text recognition, but it is costly to annotate them. Therefore, methods utilizing unlabeled data are researched. We study self-supervised pre-training methods based on masked label prediction using three different approaches -- Feature Quantization, VQ-VAE, and Post-Quantized AE. We also investigate joint-embedding approaches with VICReg and NT-Xent objectives, for which we propose an image shifting technique to prevent model collapse where it relies solely on positional encoding while completely ignoring the input image. We perform our experiments on historical handwritten (Bentham) and historical printed datasets mainly to investigate the benefits of the self-supervised pre-training techniques with different amounts of annotated target domain data. We use transfer learning as strong baselines. The evaluation shows that the self-supervised pre-training on data from the target domain is very effective, but it struggles to outperform transfer learning from closely related domains. This paper is one of the first researches exploring self-supervised pre-training in document text recognition, and we believe that it will become a cornerstone for future research in this area. We made our implementation of the investigated methods publicly available at this https URL.
在本文中,我们研究了用于文档文本识别的自监督预训练方法。目前,可以收集许多研究任务的大规模未标注数据集,包括文本识别,但是 annotate它们成本高昂。因此,我们研究了利用未标注数据的方法。我们研究了基于遮罩标签预测的自监督预训练方法,使用了三种不同的方法 - 特征量化、VQ-VAE和后量化AE。我们还研究了使用VICReg和NT-Xent目标,以及我们提出的图像平移技术,以防止模型过拟合,其中模型仅依赖于位置编码而完全忽视输入图像。我们在历史手写(Bentham)和印刷数据集上进行实验,主要研究了不同数量标注目标领域数据的自监督预训练方法的益处。我们使用迁移学习作为强基线。评估显示,来自目标领域的数据的自监督预训练非常有效,但很难在相关领域中超越迁移学习。本文是第一个研究文档文本识别中自监督预训练的论文,我们认为它将成为未来研究领域的基石。我们将研究方法的实现公开发布在本文的https:// URL上。
https://arxiv.org/abs/2405.00420
This paper investigates the effectiveness of self-supervised pre-trained transformers compared to supervised pre-trained transformers and conventional neural networks (ConvNets) for detecting various types of deepfakes. We focus on their potential for improved generalization, particularly when training data is limited. Despite the notable success of large vision-language models utilizing transformer architectures in various tasks, including zero-shot and few-shot learning, the deepfake detection community has still shown some reluctance to adopt pre-trained vision transformers (ViTs), especially large ones, as feature extractors. One concern is their perceived excessive capacity, which often demands extensive data, and the resulting suboptimal generalization when training or fine-tuning data is small or less diverse. This contrasts poorly with ConvNets, which have already established themselves as robust feature extractors. Additionally, training and optimizing transformers from scratch requires significant computational resources, making this accessible primarily to large companies and hindering broader investigation within the academic community. Recent advancements in using self-supervised learning (SSL) in transformers, such as DINO and its derivatives, have showcased significant adaptability across diverse vision tasks and possess explicit semantic segmentation capabilities. By leveraging DINO for deepfake detection with modest training data and implementing partial fine-tuning, we observe comparable adaptability to the task and the natural explainability of the detection result via the attention mechanism. Moreover, partial fine-tuning of transformers for deepfake detection offers a more resource-efficient alternative, requiring significantly fewer computational resources.
本文研究了自监督预训练Transformer与监督预训练Transformer和传统神经网络(CNN)在检测不同种类的深度伪造时的有效性。我们重点关注其提高泛化能力的潜力,尤其是在训练数据有限的情况下。尽管在各种任务中成功应用大型视觉语言模型(ViT)的Transformer架构取得了显著的成功,包括零 shot 和少 shot 学习,但深度伪造检测领域仍然有人不情愿采用预训练的视觉Transformer(ViT),尤其是大型的 ones 作为特征提取器。一个关注点是它们被认为具有过度的能力,通常需要大量的数据,并且在训练或微调数据小或缺乏多样性时,导致泛化效果不佳。这使得Transformer与 ConvNets 相比相形见绌,后者已经在学术研究中被证明是稳健的特征提取器。此外,从零开始训练和优化Transformer需要大量的计算资源,这使得这个方法主要对大型公司开放,同时也限制了学术研究 community 的进一步调查。最近在Transformer中使用自监督学习(SSL)的进展,如DINO及其派生物,展示了在各种视觉任务中的显著适应性,并具有显式的语义分割能力。通过利用DINO进行深度伪造检测,并在小训练数据上进行部分微调,我们观察到与任务相当的应用能力,并通过注意机制自然地解释检测结果。此外,部分微调Transformer进行深度伪造检测提供了一种更资源有效的选择,只需要显著少的计算资源。
https://arxiv.org/abs/2405.00355
This paper considers a Min-Max Multiple Traveling Salesman Problem (MTSP), where the goal is to find a set of tours, one for each agent, to collectively visit all the cities while minimizing the length of the longest tour. Though MTSP has been widely studied, obtaining near-optimal solutions for large-scale problems is still challenging due to its NP-hardness. Recent efforts in data-driven methods face challenges of the need for hard-to-obtain supervision and issues with high variance in gradient estimations, leading to slow convergence and highly suboptimal solutions. We address these issues by reformulating MTSP as a bilevel optimization problem, using the concept of imperative learning (IL). This involves introducing an allocation network that decomposes the MTSP into multiple single-agent traveling salesman problems (TSPs). The longest tour from these TSP solutions is then used to self-supervise the allocation network, resulting in a new self-supervised, bilevel, end-to-end learning framework, which we refer to as imperative MTSP (iMTSP). Additionally, to tackle the high-variance gradient issues during the optimization, we introduce a control variate-based gradient estimation algorithm. Our experiments showed that these innovative designs enable our gradient estimator to converge 20% faster than the advanced reinforcement learning baseline and find up to 80% shorter tour length compared with Google OR-Tools MTSP solver, especially in large-scale problems (e.g. 1000 cities and 15 agents).
本文考虑了最小-最大多代理商旅行商问题(MTSP),其目标是找到一组路径,每个代理商都访问所有城市,同时最小化最长路径的长度。尽管MTSP已广泛研究,但在大规模问题中得到最优解仍然具有挑战性,因为其具有NP困难性。最近的数据驱动方法面临着需要获得困难监督以及高梯度估计问题,导致收敛缓慢和高次优解的问题。为了应对这些问题,我们将MTSP重新建模为双层优化问题,使用指令学习(IL)的概念。这包括引入一个分配网络,将MTSP分解为多个单代理商旅行商问题(TSPs)。然后使用这些TSP解决方案中的最长路径来自动监督分配网络,从而实现了一种新的自监督、双层、端到端学习框架,我们称之为指令MTSP(iMTSP)。此外,为了在优化过程中解决高方差梯度问题,我们引入了一种基于控制变量的梯度估计算法。我们的实验证明,这些创新设计使我们的梯度估计算法能够比高级强化学习基线快20%收敛,并且在大型问题(例如1000个城市和15个代理商)中找到缩短约80%的路径长度。
https://arxiv.org/abs/2405.00285
Large language models (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of language modelling techniques to audio data. However, traditional codecs often operate at high bitrates or within narrow domains such as speech and lack the semantic clues required for efficient language modelling. Addressing these challenges, we introduce SemantiCodec, a novel codec designed to compress audio into fewer than a hundred tokens per second across diverse audio types, including speech, general audio, and music, without compromising quality. SemantiCodec features a dual-encoder architecture: a semantic encoder using a self-supervised AudioMAE, discretized using k-means clustering on extensive audio data, and an acoustic encoder to capture the remaining details. The semantic and acoustic encoder outputs are used to reconstruct audio via a diffusion-model-based decoder. SemantiCodec is presented in three variants with token rates of 25, 50, and 100 per second, supporting a range of ultra-low bit rates between 0.31 kbps and 1.43 kbps. Experimental results demonstrate that SemantiCodec significantly outperforms the state-of-the-art Descript codec on reconstruction quality. Our results also suggest that SemantiCodec contains significantly richer semantic information than all evaluated audio codecs, even at significantly lower bitrates. Our code and demos are available at this https URL.
大语言模型(LLMs)通过将音频转换为离散 tokens 的音频编码器显著提高了音频处理能力,使得将语言建模技术应用于音频数据成为可能。然而,传统的编码器通常操作在高速位率或狭窄的领域内,如语音,缺乏进行高效语言建模所需的语义线索。为解决这些挑战,我们引入了 SemantiCodec,一种专为将音频压缩成不到100个 tokens per second 的音频编码器,包括语音、通用音频和音乐,同时不牺牲质量。SemantiCodec 具有双重编码器架构:一个使用自监督的 AudioMAE 的语义编码器,通过扩展音频数据上的 k-means 聚类进行离散化,以及一个声学编码器来捕捉剩余细节。语义和声学编码器的输出被用于通过扩散模型解码器重构音频。SemantiCodec 推出了三种版本,具有不同的 token rate,支持在 0.31 kbps 和 1.43 kbps 之间运行的极低比特率。实验结果表明,SemantiCodec 在重构质量方面显著优于最先进的 Descript 编码器。我们的结果还表明,即使在显著较低的比特率下,SemantiCodec 也包含比所有评估音频编码器更丰富的语义信息。我们的代码和演示文稿可以从此链接获取。
https://arxiv.org/abs/2405.00233
Unsupervised object discovery is becoming an essential line of research for tackling recognition problems that require decomposing an image into entities, such as semantic segmentation and object detection. Recently, object-centric methods that leverage self-supervision have gained popularity, due to their simplicity and adaptability to different settings and conditions. However, those methods do not exploit effective techniques already employed in modern self-supervised approaches. In this work, we consider an object-centric approach in which DINO ViT features are reconstructed via a set of queried representations called slots. Based on that, we propose a masking scheme on input features that selectively disregards the background regions, inducing our model to focus more on salient objects during the reconstruction phase. Moreover, we extend the slot attention to a multi-query approach, allowing the model to learn multiple sets of slots, producing more stable masks. During training, these multiple sets of slots are learned independently while, at test time, these sets are merged through Hungarian matching to obtain the final slots. Our experimental results and ablations on the PASCAL-VOC 2012 dataset show the importance of each component and highlight how their combination consistently improves object localization. Our source code is available at: this https URL
无监督的对象发现已经成为解决需要将图像分解成实体以解决识别问题的关键技术之一。最近,利用自监督的方法在对象中心方法中取得了广泛的应用,因为它们的简单性和对不同设置和条件的适应性。然而,这些方法并没有充分利用现代自监督方法中已经采用的有效技术。在本文中,我们考虑了一种以对象为中心的方法,其中通过查询表示的槽来重构DINO ViT特征。基于这种方法,我们提出了一个在输入特征上应用遮罩方案,该方案选择性地忽略了背景区域,使我们的模型在重构阶段更加关注显著的物体。此外,我们还将槽注意扩展到多查询方法,使模型能够学习多个设置,从而产生更稳定的掩码。在训练过程中,这些多个设置是独立学习的,而在测试时,这些设置通过匈牙利匹配合并为最终设置。我们对PASCAL-VOC 2012数据集的实验结果和分析表明了每个组件的重要性,并强调了它们组合一致地改善了物体定位。我们的源代码可在此处访问:https:// this URL
https://arxiv.org/abs/2404.19654