Recently, the usage of speech self-supervised models (SSL) for downstream tasks has been drawing a lot of attention. While large pre-trained models commonly outperform smaller models trained from scratch, questions regarding the optimal fine-tuning strategies remain prevalent. In this paper, we explore the fine-tuning strategies of the WavLM Large model for the speech emotion recognition task on the MSP Podcast Corpus. More specifically, we perform a series of experiments focusing on using gender and semantic information from utterances. We then sum up our findings and describe the final model we used for submission to Speech Emotion Recognition Challenge 2024.
近年来,将自监督语音模型(SSL)用于下游任务的话题引起了广泛关注。虽然大型预训练模型通常比从头训练的较小模型表现更好,但关于最佳微调策略的问题仍然普遍存在。在本文中,我们探讨了WavLM Large模型在MSP播客数据集上对语音情感识别任务的微调策略。具体来说,我们进行了一系列实验,重点关注使用说话人的性别和语义信息。然后,我们总结了我们的研究结果,并描述了提交给2024年Speech Emotion Recognition Challenge的最终模型。
https://arxiv.org/abs/2405.04485
The sensing and positioning capabilities foreseen in 6G have great potential for technology advancements in various domains, such as future smart cities and industrial use cases. Channel charting has emerged as a promising technology in recent years for radio frequency-based sensing and localization. However, the accuracy of these techniques is yet far behind the numbers envisioned in 6G. To reduce this gap, in this paper, we propose a novel channel charting technique capitalizing on the time of arrival measurements from surrounding Transmission Reception Points (TRPs) along with their locations and leveraging sensor fusion in channel charting by incorporating laser scanner data during the training phase of our algorithm. The proposed algorithm remains self-supervised during training and test phases, requiring no geometrical models or user position ground truth. Simulation results validate the achievement of a sub-meter level localization accuracy using our algorithm 90% of the time, outperforming the state-of-the-art channel charting techniques and the traditional triangulation-based approaches.
6G预计具有的感知和定位能力在各个领域都有很大的技术进步潜力,如未来的智能城市和工业应用场景。近年来,基于无线电频段的感知和定位技术已成为一个有前景的技术,特别是在无线通信和局部定位领域。然而,这些技术的准确性目前还远远落后于6G所预期的水平。为了缩小这个差距,本文提出了一种利用周围传输接收点(TRPs)的到达时间测量以及其位置,并利用我们算法在训练阶段融合激光扫描数据的新颖信道图技术。所提出的算法在训练和测试阶段都是自监督的,不需要几何模型或用户位置的地面真实值。仿真结果证实了使用我们的算法在90%的时间内实现了亚米级别的定位精度,超过了最先进的信道图技术和传统的三元组基定位方法。
https://arxiv.org/abs/2405.04357
Self-Supervised Learning (SSL) has proven to be useful in various speech tasks. However, these methods are generally very demanding in terms of data, memory, and computational resources. BERT-based Speech pre-Training with Random-projection Quantizer (BEST-RQ), is an SSL method that has shown great performance on Automatic Speech Recognition (ASR) while being simpler than other SSL methods, such as wav2vec 2.0. Despite BEST-RQ's great performance, details are lacking in the original paper, such as the amount of GPU/TPU hours used in pre-training, and there is no official easy-to-use open-source implementation. Furthermore, BEST-RQ has not been evaluated on other downstream tasks aside from ASR and speech translation. In this work, we describe a re-implementation of a Random-projection quantizer and perform a preliminary study with a comparison to wav2vec 2.0 on four downstream tasks. We discuss the details and differences of our implementation. We show that a random projection quantizer can achieve similar downstream performance as wav2vec 2.0 while decreasing training time by over a factor of two.
自监督学习(SSL)在各种语音任务中已经证明非常有用。然而,这些方法在数据、内存和计算资源方面通常非常具有挑战性。基于BERT的语音预训练随机投影量化器(BEST-RQ)是一种 SSL 方法,它在自动语音识别(ASR)方面表现出与其他 SSL 方法相似但更简单的效果。尽管BEST-RQ在ASR方面的表现非常出色,但原始论文中的细节缺乏,例如预训练过程中使用的GPU/TPU小时数,以及没有官方易用的开源实现。此外,BEST-RQ尚未在除ASR和语音翻译以外的其他下游任务上进行评估。在这项工作中,我们描述了一个随机投影量化器的重新实现,并比较了它与 wav2vec 2.0 在四个下游任务上的效果。我们讨论了我们实现的具体细节和与 wav2vec 2.0 的差异。我们证明了,与 wav2vec 2.0 相比,随机投影量化器可以实现类似的下游性能,同时将训练时间降低到原来的两倍。
https://arxiv.org/abs/2405.04296
Graph self-supervised learning has sparked a research surge in training informative representations without accessing any labeled data. However, our understanding of graph self-supervised learning remains limited, and the inherent relationships between various self-supervised tasks are still unexplored. Our paper aims to provide a fresh understanding of graph self-supervised learning based on task correlations. Specifically, we evaluate the performance of the representations trained by one specific task on other tasks and define correlation values to quantify task correlations. Through this process, we unveil the task correlations between various self-supervised tasks and can measure their expressive capabilities, which are closely related to downstream performance. By analyzing the correlation values between tasks across various datasets, we reveal the complexity of task correlations and the limitations of existing multi-task learning methods. To obtain more capable representations, we propose Graph Task Correlation Modeling (GraphTCM) to illustrate the task correlations and utilize it to enhance graph self-supervised training. The experimental results indicate that our method significantly outperforms existing methods across various downstream tasks.
图形自监督学习引起了在无需访问任何标注数据的情况下进行训练的有用表示的研究爆发。然而,我们对图形自监督学习的理解仍然有限,并且各种自监督任务之间的内在关系仍然没有被探索。本文旨在基于任务相关性提供对图形自监督学习的全新理解。具体来说,我们评估特定任务对其他任务的表示性能,并定义关联值来量化任务相关性。通过这个过程,我们揭示了各种自监督任务之间的任务相关性,并可以衡量它们的表达能力,这与下游性能密切相关。通过分析各种数据集之间任务之间的关联值,我们揭示了任务相关性的复杂性和现有多任务学习方法的局限性。为了获得更强大的表示,我们提出了Graph Task Correlation Modeling (GraphTCM)来说明任务相关性,并利用它来增强图形自监督训练。实验结果表明,我们的方法在各种下游任务上显著优于现有方法。
https://arxiv.org/abs/2405.04245
Bimanual manipulation is a longstanding challenge in robotics due to the large number of degrees of freedom and the strict spatial and temporal synchronization required to generate meaningful behavior. Humans learn bimanual manipulation skills by watching other humans and by refining their abilities through play. In this work, we aim to enable robots to learn bimanual manipulation behaviors from human video demonstrations and fine-tune them through interaction. Inspired by seminal work in psychology and biomechanics, we propose modeling the interaction between two hands as a serial kinematic linkage -- as a screw motion, in particular, that we use to define a new action space for bimanual manipulation: screw actions. We introduce ScrewMimic, a framework that leverages this novel action representation to facilitate learning from human demonstration and self-supervised policy fine-tuning. Our experiments demonstrate that ScrewMimic is able to learn several complex bimanual behaviors from a single human video demonstration, and that it outperforms baselines that interpret demonstrations and fine-tune directly in the original space of motion of both arms. For more information and video results, this https URL
手动操作是一个在机器人领域长期存在的挑战,由于需要大量自由度和严格的空间和时间同步来产生有意义的动作,使得实现有意义的行为变得具有挑战性。人类通过观察其他人类并通过游戏来提高他们的能力来学习双手操作技能。在这项工作中,我们旨在使机器人能够从人类视频演示中学习双手操作行为,并通过互动对其进行微调。受到心理学和生物力学中关键工作的启发,我们提出将两个手的交互建模为串行运动学链接——特别是螺钉运动,作为我们定义一个新的双手操作空间的方式:螺钉操作。我们引入了ScrewMimic框架,该框架利用这种新颖的动作表示来促进从人类演示中学习技能和自我监督策略微调。我们的实验结果表明,ScrewMimic能够从单个人类视频演示中学习多个复杂双手操作行为,并且它优于那些在原始运动空间中解释演示并进行微调的基线。更多信息和视频结果,请访问此链接:https:// URL
https://arxiv.org/abs/2405.03666
We propose a hybrid framework for consistently producing high-quality object tracks by combining an automated object tracker with little human input. The key idea is to tailor a module for each dataset to intelligently decide when an object tracker is failing and so humans should be brought in to re-localize an object for continued tracking. Our approach leverages self-supervised learning on unlabeled videos to learn a tailored representation for a target object that is then used to actively monitor its tracked region and decide when the tracker fails. Since labeled data is not needed, our approach can be applied to novel object categories. Experiments on three datasets demonstrate our method outperforms existing approaches, especially for small, fast moving, or occluded objects.
我们提出了一个混合框架,通过将自动物体跟踪器与少量的手动输入相结合,一致地生产高质量的物体跟踪。关键思想是为每个数据集定制一个模块,使得物体跟踪器在失败时能够智能地决定何时需要人类干预重新定位目标物体。我们的方法利用无标注视频上的自监督学习来学习针对目标物体的定制表示,然后用于积极监控跟踪区域并决定跟踪器何时失败。由于无需标记数据,我们的方法可以应用于新的物体类别。在三个数据集上的实验证明,我们的方法超越了现有方法,特别是对于小、快速移动或被遮挡的物体。
https://arxiv.org/abs/2405.03643
Deep neural networks have reached remarkable achievements in medical image processing tasks, specifically classifying and detecting various diseases. However, when confronted with limited data, these networks face a critical vulnerability, often succumbing to overfitting by excessively memorizing the limited information available. This work addresses the challenge mentioned above by improving the supervised contrastive learning method to reduce the impact of false positives. Unlike most existing methods that rely predominantly on fully supervised learning, our approach leverages the advantages of self-supervised learning in conjunction with employing the available labeled data. We evaluate our method on the BreakHis dataset, which consists of breast cancer histopathology images, and demonstrate an increase in classification accuracy by 1.45% at the image level and 1.42% at the patient level compared to the state-of-the-art method. This improvement corresponds to 93.63% absolute accuracy, highlighting our approach's effectiveness in leveraging data properties to learn more appropriate representation space.
深度神经网络在医学图像处理任务中取得了显著的成就,尤其是在分类和检测各种疾病方面。然而,当面对有限数据时,这些网络面临着一个关键的漏洞,常常通过过度依赖有限信息而陷入过拟合。本文通过改进有监督对比学习方法来应对上述挑战。与大多数现有方法主要依赖完全监督学习不同,我们的方法利用自监督学习的优势,并利用现有标记数据。我们在BreakHis数据集上评估我们的方法,该数据集包含乳腺癌 histopathology 图像,证明了在图像级别和患者级别,分类准确率分别比最先进的方法提高了1.45%和1.42%。这种提高相当于93.63%的绝对准确率,强调了我们的方法利用数据属性来学习更合适的表示空间的有效性。
https://arxiv.org/abs/2405.03642
This study accelerates MR cholangiopancreatography (MRCP) acquisitions using deep learning-based (DL) reconstruction at 3T and 0.55T. Thirty healthy volunteers underwent conventional two-fold MRCP scans at field strengths of 3T or 0.55T. We trained a variational network (VN) using retrospectively six-fold undersampled data obtained at 3T. We then evaluated our method against standard techniques such as parallel imaging (PI) and compressed sensing (CS), focusing on peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) as metrics. Furthermore, considering acquiring fully-sampled MRCP is impractical, we added a self-supervised DL reconstruction (SSDU) to the evaluating group. We also tested our method in a prospective accelerated scenario to reflect real-world clinical applications and evaluated its adaptability to MRCP at 0.55T. Our method demonstrated a remarkable reduction of average acquisition time from 599/542 to 255/180 seconds for MRCP at 3T/0.55T. In both retrospective and prospective undersampling scenarios, the PSNR and SSIM of VN were higher than those of PI, CS, and SSDU. At the same time, VN preserved the image quality of undersampled data, i.e., sharpness and the visibility of hepatobiliary ducts. In addition, VN also produced high quality reconstructions at 0.55T resulting in the highest PSNR and SSIM. In summary, VN trained for highly accelerated MRCP allows to reduce the acquisition time by a factor of 2.4/3.0 at 3T/0.55T while maintaining the image quality of the conventional acquisition.
这项研究使用基于深度学习的(DL)复原加速了3T和0.55T的MRCP获取。在对3T或0.55T的场地强度下,30名健康志愿者接受了传统的两倍MRCP扫描。我们使用在3T上获得的反向采样数据训练了一个变分网络(VN)。然后,我们评估我们的方法与标准技术(如并行成像和压缩感知)的区别,重点关注峰值信号-噪声比(PSNR)和结构相似性(SSIM)。此外,考虑到完全采样MRCP是不切实际的,我们在评估组中添加了自监督的DL复原(SSDU)。我们还将在3T/0.55T的前瞻性加速情景中测试我们的方法,以反映临床应用的真实情况,并评估其在0.55T下MRCP的适应性。我们的方法在3T/0.55T下的平均采集时间从599/542秒降低到255/180秒。在反向和前瞻性欠采样情景中,VN的PSNR和SSIM均高于PI、CS和SSDU。同时,VN在0.55T下保持了欠采样数据的图像质量,即清晰度和肝胆管的可见性。此外,VN还在0.55T下产生了高质量的重构,导致PSNR和SSIM最高。总之,为高度加速的MRCP训练VN可以在3T/0.55T下将采集时间降低2.4/3.0倍,同时保持传统采集的图像质量。
https://arxiv.org/abs/2405.03732
The paper introduces AniTalker, an innovative framework designed to generate lifelike talking faces from a single portrait. Unlike existing models that primarily focus on verbal cues such as lip synchronization and fail to capture the complex dynamics of facial expressions and nonverbal cues, AniTalker employs a universal motion representation. This innovative representation effectively captures a wide range of facial dynamics, including subtle expressions and head movements. AniTalker enhances motion depiction through two self-supervised learning strategies: the first involves reconstructing target video frames from source frames within the same identity to learn subtle motion representations, and the second develops an identity encoder using metric learning while actively minimizing mutual information between the identity and motion encoders. This approach ensures that the motion representation is dynamic and devoid of identity-specific details, significantly reducing the need for labeled data. Additionally, the integration of a diffusion model with a variance adapter allows for the generation of diverse and controllable facial animations. This method not only demonstrates AniTalker's capability to create detailed and realistic facial movements but also underscores its potential in crafting dynamic avatars for real-world applications. Synthetic results can be viewed at this https URL.
本文介绍了一种创新框架AniTalker,它旨在从单个肖像中生成逼真的谈话面部。与主要关注口头线索(如嘴唇同步)并无法捕捉面部表情和非口头线索复杂动态的现有模型不同,AniTalker采用了一种通用的运动表示。这种创新表示有效地捕捉了广泛的面部动态,包括微妙的表情和头部运动。通过两种自监督学习策略来增强运动描绘:第一种涉及从同一身份的源帧重构目标视频帧以学习微妙的运动表示,第二种则使用度量学习来开发身份编码器,并在积极最小化身份和运动编码器之间的互信息。这种方法确保了运动表示是动态的,且不包含身份特定的细节,从而大大减少了标记数据的需求。此外,将扩散模型与方差适配器相结合允许生成多样且可控制的面部动画。这种方法不仅展示了AniTalker创建详细和逼真的面部运动的能力,还突出了它在现实应用中制作动态虚拟人物的可能性。合成结果可以在https://url.cn/AniTalker查看。
https://arxiv.org/abs/2405.03121
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples. Such ability stems from their capacity to identify common features shared between new and previously seen images while disregarding distractions such as background variations. However, for artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge. In this paper, we propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches and encoding them using the pre-trained Vision Transformer (ViT) architecture. Specifically, we swap the class (CLS) token and patch tokens between the support and query sets to have the mutual attention, which enables each set to focus on the most useful information. This facilitates the strengthening of intra-class representations and promotes closer proximity between instances of the same class. For implementation, we adopt the ViT-based network architecture and utilize pre-trained model parameters obtained through self-supervision. By leveraging Masked Image Modeling as a self-supervised training task for pre-training, the pre-trained model yields semantically meaningful representations while successfully avoiding supervision collapse. We then employ a meta-learning method to fine-tune the last several layers and CLS token modules. Our strategy significantly reduces the num- ber of parameters that require fine-tuning while effectively uti- lizing the capability of pre-trained model. Extensive experiments show that our framework is simple, effective and computationally efficient, achieving superior performance as compared to the state-of-the-art baselines on five popular few-shot classification benchmarks under the 5-shot and 1-shot scenarios
人类具有惊人的能力,在仅接受几张示例后,准确地对新的、未见过的图像进行分类。这种能力源于他们能够识别出新旧图像之间共有的常见特征,同时忽略诸如背景变化等干扰因素。然而,对于人工神经网络模型来说,在有限样本的情况下,确定最具区分性的特征以区分两张图像是一个挑战。在本文中,我们提出了一种基于自监督的少样本学习方法,该方法将支持集和查询集划分为补丁并使用预训练的Vision Transformer(ViT)架构进行编码。具体来说,我们交换支持集和查询集中的类(CLS)token和补丁token,实现相互关注,从而使每个集合都能关注最有用的信息。这有助于加强类内表示,促进同一类实例之间的 closer proximity。 实现方面,我们采用了基于ViT的网络架构,并利用自监督预训练模型参数。通过将遮罩图像建模作为一种自监督训练任务进行预训练,预训练模型产生了具有语义意义的表示,同时成功避免了监督衰减。然后,我们采用元学习方法对最后一层和CLS模块进行微调。我们的策略显著减少了需要微调的参数数量,同时有效地利用了预训练模型的能力。 大量的实验结果表明,我们的框架简单、有效且计算高效,在5-shot和1-shot场景下,相较于最先进的基线,我们的框架具有卓越的性能。
https://arxiv.org/abs/2405.03109
Due to the ever-increasing availability of video surveillance cameras and the growing need for crime prevention, the violence detection task is attracting greater attention from the research community. With respect to other action recognition tasks, violence detection in surveillance videos shows additional issues, such as the presence of a significant variety of real fight scenes. Unfortunately, available datasets seem to be very small compared with other action recognition datasets. Moreover, in surveillance applications, people in the scenes always differ for each video and the background of the footage differs for each camera. Also, violent actions in real-life surveillance videos must be detected quickly to prevent unwanted consequences, thus models would definitely benefit from a reduction in memory usage and computational costs. Such problems make classical action recognition methods difficult to be adopted. To tackle all these issues, we introduce JOSENet, a novel self-supervised framework that provides outstanding performance for violence detection in surveillance videos. The proposed model receives two spatiotemporal video streams, i.e., RGB frames and optical flows, and involves a new regularized self-supervised learning approach for videos. JOSENet provides improved performance compared to self-supervised state-of-the-art methods, while requiring one-fourth of the number of frames per video segment and a reduced frame rate. The source code and the instructions to reproduce our experiments are available at this https URL.
由于视频监控摄像头的不断增加和犯罪预防的需求不断增加,暴力检测任务正在从研究社区获得越来越多的关注。与其他动作识别任务相比,监控视频中的暴力检测任务还表现为其他问题,例如存在显著的实战场景。然而,现有的数据集似乎与其他动作识别数据集相比非常小。此外,在视频应用中,每个视频场景的人都有所不同,视频拍摄的角度也有所不同。为了快速检测现实生活中的暴力行为,防止不良后果,因此模型可以从内存使用和计算成本的降低中获得好处。这些问题使得经典动作识别方法难以采用。为解决这些问题,我们引入了JOSENet,一种新颖的自监督框架,可以在监控视频中的暴力检测方面提供卓越的表现。与自监督先进的视频分类方法相比,所提出的模型具有更好的性能,同时每个视频片段需要四分之一帧数和降低帧率。JOSENet的源代码和重现实验的说明可以在该链接处找到。
https://arxiv.org/abs/2405.02961
Artificial neural networks trained on large, expert-labelled datasets are considered state-of-the-art for a range of medical image recognition tasks. However, categorically labelled datasets are time-consuming to generate and constrain classification to a pre-defined, fixed set of classes. For neuroradiological applications in particular, this represents a barrier to clinical adoption. To address these challenges, we present a self-supervised text-vision framework that learns to detect clinically relevant abnormalities in brain MRI scans by directly leveraging the rich information contained in accompanying free-text neuroradiology reports. Our training approach consisted of two-steps. First, a dedicated neuroradiological language model - NeuroBERT - was trained to generate fixed-dimensional vector representations of neuroradiology reports (N = 50,523) via domain-specific self-supervised learning tasks. Next, convolutional neural networks (one per MRI sequence) learnt to map individual brain scans to their corresponding text vector representations by optimising a mean square error loss. Once trained, our text-vision framework can be used to detect abnormalities in unreported brain MRI examinations by scoring scans against suitable query sentences (e.g., 'there is an acute stroke', 'there is hydrocephalus' etc.), enabling a range of classification-based applications including automated triage. Potentially, our framework could also serve as a clinical decision support tool, not only by suggesting findings to radiologists and detecting errors in provisional reports, but also by retrieving and displaying examples of pathologies from historical examinations that could be relevant to the current case based on textual descriptors.
通过在大型、专家标注的数据集上训练的人工神经网络被认为是各种医学图像识别任务的当前最先进的。然而,分类标注的数据集需要花费较长的时间来生成,并限制将分类限制为预定义、固定的类。特别是,在神经放射学应用中,这代表了临床采用的障碍。为了应对这些挑战,我们提出了一个自监督的文本视觉框架,通过直接利用伴随的免费文本神经放射学报告中的丰富信息来检测临床相关的异常脑MRI扫描。我们的训练方法包括两个步骤。首先,一个专门的语言模型——NeuroBERT 通过领域特定的自监督学习任务训练,生成固定维度的神经放射学报告的固定维向量表示(N = 50,523)。接下来,卷积神经网络(每个MRI序列一个)通过优化均方误差损失来学习将单个脑扫描映射到相应的文本向量表示。经过训练后,我们的文本视觉框架可用于通过评分扫描与适当的查询句子(例如,“有急性中风”,“有高血压”等)相匹配来检测未报告的脑MRI examination中的异常,实现各种分类基础应用(包括自动分类分诊)。可能的是,我们的框架还可以作为临床决策支持工具,不仅通过向放射科医生建议发现,还通过根据文本描述检索和显示历史检查中的疾病实例来发挥作用。
https://arxiv.org/abs/2405.02782
Molecular property prediction is a key component of AI-driven drug discovery and molecular characterization learning. Despite recent advances, existing methods still face challenges such as limited ability to generalize, and inadequate representation of learning from unlabeled data, especially for tasks specific to molecular structures. To address these limitations, we introduce DIG-Mol, a novel self-supervised graph neural network framework for molecular property prediction. This architecture leverages the power of contrast learning with dual interaction mechanisms and unique molecular graph enhancement strategies. DIG-Mol integrates a momentum distillation network with two interconnected networks to efficiently improve molecular characterization. The framework's ability to extract key information about molecular structure and higher-order semantics is supported by minimizing loss of contrast. We have established DIG-Mol's state-of-the-art performance through extensive experimental evaluation in a variety of molecular property prediction tasks. In addition to demonstrating superior transferability in a small number of learning scenarios, our visualizations highlight DIG-Mol's enhanced interpretability and representation capabilities. These findings confirm the effectiveness of our approach in overcoming challenges faced by traditional methods and mark a significant advance in molecular property prediction.
分子性质预测是人工智能驱动的药物发现和分子特征分析学习中的关键组成部分。尽管近年来取得了进展,但现有方法仍然面临着一些挑战,如缺乏泛化能力以及从无标签数据中学习的不足,尤其是在分子结构特定的任务中。为了应对这些限制,我们引入了DIG-Mol,一种新颖的自监督图神经网络框架用于分子性质预测。该架构利用对比学习与双交互机制的独特分子图增强策略。DIG-Mol将卷积神经网络与两个相互连接的网络相结合,以有效地提高分子特征分析。该框架通过最小化对比损失来提取关于分子结构和更高阶语义的关键信息。通过在各种分子性质预测任务中的广泛实验评估,我们已经建立了DIG-Mol的最先进性能。除了在少数学习场景中表现出优越的可转移性外,我们的可视化结果还突显了DIG-Mol在增强可解释性和表示能力方面的优势。这些发现证实了我们的方法在克服传统方法的挑战方面取得了显著的进展,并为分子性质预测领域带来了重要的发展。
https://arxiv.org/abs/2405.02628
Traditional unsupervised optical flow methods are vulnerable to occlusions and motion boundaries due to lack of object-level information. Therefore, we propose UnSAMFlow, an unsupervised flow network that also leverages object information from the latest foundation model Segment Anything Model (SAM). We first include a self-supervised semantic augmentation module tailored to SAM masks. We also analyze the poor gradient landscapes of traditional smoothness losses and propose a new smoothness definition based on homography instead. A simple yet effective mask feature module has also been added to further aggregate features on the object level. With all these adaptations, our method produces clear optical flow estimation with sharp boundaries around objects, which outperforms state-of-the-art methods on both KITTI and Sintel datasets. Our method also generalizes well across domains and runs very efficiently.
传统的无监督光学流方法由于缺乏物体级别信息,容易受到遮挡和运动边界的困扰。因此,我们提出了UnSAMFlow,一种利用最新基础模型Segment Anything Model(SAM)中的物体信息的无监督流网络。我们首先包括一个自监督的语义增强模块,专门针对SAM掩码。我们还分析了传统平滑度损失函数的糟糕梯度景观,并基于仿射学定义提出了新的平滑度定义。还添加了一个简单的 yet effective 掩码特征模块,进一步在物体级别汇总特征。有了这些修改,我们的方法在物体级别具有清晰的光学流估计,边缘清晰,超越了最先进的方法在KITTI和Sintel数据集上的表现。我们的方法还具有良好的跨领域泛化能力,运行效率非常高。
https://arxiv.org/abs/2405.02608
Computed tomography (CT) is a widely used non-invasive medical imaging technique for disease diagnosis. The diagnostic accuracy is often affected by image resolution, which can be insufficient in practice. For medical CT images, the through-plane resolution is often worse than the in-plane resolution and there can be overlap between slices, causing difficulties in diagnoses. Self-supervised methods for through-plane resolution enhancement, which train on in-plane images and infer on through-plane images, have shown promise for both CT and MRI imaging. However, existing self-supervised methods either neglect overlap or can only handle specific cases with fixed combinations of resolution and overlap. To address these limitations, we propose a self-supervised method called SR4ZCT. It employs the same off-axis training approach while being capable of handling arbitrary combinations of resolution and overlap. Our method explicitly models the relationship between resolutions and voxel spacings of different planes to accurately simulate training images that match the original through-plane images. We highlight the significance of accurate modeling in self-supervised off-axis training and demonstrate the effectiveness of SR4ZCT using a real-world dataset.
计算断层扫描(CT)是一种广泛应用于疾病诊断的非侵入性医疗影像技术。诊断准确性通常受到图像分辨率的影响,在实际应用中可能不足以准确。对于医学CT图像,通过平面的分辨率通常比在平面分辨率更差,而且切片之间可能存在重叠,导致诊断困难。自监督方法通过在平面图像上训练并通过平面图像进行推断,对CT和MRI成像都显示出希望。然而,现有的自监督方法要么忽视重叠,要么只能处理具有固定组合分辨率和平面重叠的特定情况。为了克服这些限制,我们提出了一个名为SR4ZCT的自监督方法。它采用相同的离轴训练方法,同时能够处理任意组合的分辨率和重叠。我们的方法明确地建模了不同平面的分辨率和体素空间之间的关系,准确地模拟了与原始通过平面图像匹配的训练图像。我们强调了在自监督离轴训练中准确建模的重要性,并通过一个真实世界的数据集证明了SR4ZCT的有效性。
https://arxiv.org/abs/2405.02515
In this paper, we consider two challenging issues in reference-based super-resolution (RefSR) for smartphone, (i) how to choose a proper reference image, and (ii) how to learn RefSR in a self-supervised manner. Particularly, we propose a novel self-supervised learning approach for real-world RefSR from observations at dual and multiple camera zooms. Firstly, considering the popularity of multiple cameras in modern smartphones, the more zoomed (telephoto) image can be naturally leveraged as the reference to guide the super-resolution (SR) of the lesser zoomed (ultra-wide) image, which gives us a chance to learn a deep network that performs SR from the dual zoomed observations (DZSR). Secondly, for self-supervised learning of DZSR, we take the telephoto image instead of an additional high-resolution image as the supervision information, and select a center patch from it as the reference to super-resolve the corresponding ultra-wide image patch. To mitigate the effect of the misalignment between ultra-wide low-resolution (LR) patch and telephoto ground-truth (GT) image during training, we first adopt patch-based optical flow alignment and then design an auxiliary-LR to guide the deforming of the warped LR features. To generate visually pleasing results, we present local overlapped sliced Wasserstein loss to better represent the perceptual difference between GT and output in the feature space. During testing, DZSR can be directly deployed to super-solve the whole ultra-wide image with the reference of the telephoto image. In addition, we further take multiple zoomed observations to explore self-supervised RefSR, and present a progressive fusion scheme for the effective utilization of reference images. Experiments show that our methods achieve better quantitative and qualitative performance against state-of-the-arts. Codes are available at this https URL.
在本文中,我们考虑了在基于参考图像的超分辨率(RefSR)中 two 个具有挑战性的问题:(i)如何选择一个适当的参考图像,(ii)如何在自监督的方式下学习RefSR。特别地,我们提出了一种从双摄像头和多摄像头缩放的观察中进行真实世界RefSR的新型自监督学习方法。首先,考虑到现代智能手机中多个摄像头的流行,更缩放的(望远镜)图像可以自然地作为一个参考,以指导较小缩放(超广角)图像的超分辨率(SR),这给我们机会学习从双缩放观察中进行SR的深度网络。(ii)为了自监督学习DZSR,我们选择望远镜图像作为监督信息,并从它中选择一个中心补丁作为参考,以超分辨率相应的超广角图像补丁。为了减轻在训练过程中超广角低分辨率(LR)补丁与望远镜地面真实(GT)图像之间错位的影响,我们首先采用基于补丁的图像光束对齐,然后设计了一个辅助-LR,以指导失真 LR 特征的变形。为了生成视觉效果好的结果,我们提出了局部重叠的切削韦伯损失,更好地表示 GT 和输出在特征空间中的差异。在测试期间,DZSR可以直接部署用于解决整个超广角图像。此外,我们进一步进行了多次缩放观察,以探索自监督RefSR,并提出了参考图像的有效利用方案。实验结果表明,我们的方法在定量和定性方面都优于现有技术水平。代码可在此处下载:https://www.x剔除。
https://arxiv.org/abs/2405.02171
In this paper, we present a novel approach for text independent phone-to-audio alignment based on phoneme recognition, representation learning and knowledge transfer. Our method leverages a self-supervised model (wav2vec2) fine-tuned for phoneme recognition using a Connectionist Temporal Classification (CTC) loss, a dimension reduction model and a frame-level phoneme classifier trained thanks to forced-alignment labels (using Montreal Forced Aligner) to produce multi-lingual phonetic representations, thus requiring minimal additional training. We evaluate our model using synthetic native data from the TIMIT dataset and the SCRIBE dataset for American and British English, respectively. Our proposed model outperforms the state-of-the-art (charsiu) in statistical metrics and has applications in language learning and speech processing systems. We leave experiments on other languages for future work but the design of the system makes it easily adaptable to other languages.
在本文中,我们提出了一种基于语音识别、表示学习和知识传递的新方法来实现文本独立电话到音频对齐。我们的方法利用了一个自我监督模型(wav2vec2)经过CTC损失、维度减少模型和通过强制对齐标签(使用蒙特利尔强制对齐器)训练来进行语音识别,从而生成多语言语音表示,这使得我们无需额外训练。我们使用来自TIMIT数据集和SCRIBE数据集的合成本地数据来评估我们的模型。我们提出的方法在统计指标上超过了最先进的(charsiu)模型,并在语言学习和语音处理系统中具有应用。我们将继续进行其他语言的实验,但系统的设计使其容易适应其他语言。
https://arxiv.org/abs/2405.02124
This paper presents a novel self-supervised two-frame multi-camera metric depth estimation network, termed M${^2}$Depth, which is designed to predict reliable scale-aware surrounding depth in autonomous driving. Unlike the previous works that use multi-view images from a single time-step or multiple time-step images from a single camera, M${^2}$Depth takes temporally adjacent two-frame images from multiple cameras as inputs and produces high-quality surrounding depth. We first construct cost volumes in spatial and temporal domains individually and propose a spatial-temporal fusion module that integrates the spatial-temporal information to yield a strong volume presentation. We additionally combine the neural prior from SAM features with internal features to reduce the ambiguity between foreground and background and strengthen the depth edges. Extensive experimental results on nuScenes and DDAD benchmarks show M${^2}$Depth achieves state-of-the-art performance. More results can be found in this https URL .
本文提出了一种新颖的自监督两帧多相机metric深度估计网络,称为M2Depth,旨在在自动驾驶中预测可靠的尺度感知周围深度。与之前使用单个时间步或单个相机的多视角图像相比,M2Depth将来自多个摄像头的空间相邻的两帧图像作为输入,并产生高质量的周围深度。我们首先在空间和时间域分别构建成本体积,并提出了一个空间-时间融合模块,将空间-时间信息集成为一个强大的体积展示。此外,将SAM特征的神经先验与内部特征结合以减少前景和背景之间的歧义并加强深度边缘。在nuScenes和DDAD基准上进行的大量实验结果表明,M2Depth实现了与最先进技术相当的表现。更多结果可以在该https://url.org/ URL上找到。
https://arxiv.org/abs/2405.02004
Expressive voice conversion (VC) conducts speaker identity conversion for emotional speakers by jointly converting speaker identity and emotional style. Emotional style modeling for arbitrary speakers in expressive VC has not been extensively explored. Previous approaches have relied on vocoders for speech reconstruction, which makes speech quality heavily dependent on the performance of vocoders. A major challenge of expressive VC lies in emotion prosody modeling. To address these challenges, this paper proposes a fully end-to-end expressive VC framework based on a conditional denoising diffusion probabilistic model (DDPM). We utilize speech units derived from self-supervised speech models as content conditioning, along with deep features extracted from speech emotion recognition and speaker verification systems to model emotional style and speaker identity. Objective and subjective evaluations show the effectiveness of our framework. Codes and samples are publicly available.
表达性语音转换(VC)通过共同转换说话人身份和情感风格来对情感说话人进行演讲者身份转换。对于表达性VC中任意说话人的情感风格建模,尚未进行过深入探讨。之前的解决方案依赖于语音重建器的语音重建,这使得语音质量高度依赖于语音重建器的表现。情感语音转换的一个主要挑战是情感语调建模。为了应对这些挑战,本文基于条件去噪扩散概率模型(DDPM)提出了一种完整的端到端情感语音转换框架。我们利用自监督语音模型产生的语音单元作为内容条件,同时从语音情感识别和说话人验证系统中提取深度特征来建模情感风格和说话人身份。客观和主观评估结果表明,我们的框架的有效性。代码和样本公开可用。
https://arxiv.org/abs/2405.01730
Self-supervised learning (SSL) has emerged as a key technique for training networks that can generalize well to diverse tasks without task-specific supervision. This property makes SSL desirable for computational pathology, the study of digitized images of tissues, as there are many target applications and often limited labeled training samples. However, SSL algorithms and models have been primarily developed in the field of natural images and whether their performance can be improved by adaptation to particular domains remains an open question. In this work, we present an investigation of modifications to SSL for pathology data, specifically focusing on the DINOv2 algorithm. We propose alternative augmentations, regularization functions, and position encodings motivated by the characteristics of pathology images. We evaluate the impact of these changes on several benchmarks to demonstrate the value of tailored approaches.
自监督学习(SSL)作为一种能够在没有任务特定监督的情况下泛化良好的网络训练技术,成为了一个关键的技术,尤其是在计算病理学中,病理学研究的数字图像。这种特性使得 SSL 成为计算病理学研究的理想选择,因为该领域有许多目标应用,但通常缺乏足够的标记训练样本。然而, SSL 算法和模型主要在自然图像领域开发,其性能是否可以通过适应特定领域来提高仍然是一个未解决的问题。在这项工作中,我们研究了针对病理数据的对 SSL 的修改,特别关注 DINOv2 算法。我们提出了由病理图像特征启发的替代增强、正则化函数和位置编码。我们评估了这些变化对多个基准测试的影响,以展示定制方法的价值。
https://arxiv.org/abs/2405.01688