Feedforward monocular face capture methods seek to reconstruct posed faces from a single image of a person. Current state of the art approaches have the ability to regress parametric 3D face models in real-time across a wide range of identities, lighting conditions and poses by leveraging large image datasets of human faces. These methods however suffer from clear limitations in that the underlying parametric face model only provides a coarse estimation of the face shape, thereby limiting their practical applicability in tasks that require precise 3D reconstruction (aging, face swapping, digital make-up, ...). In this paper, we propose a method for high-precision 3D face capture taking advantage of a collection of unconstrained videos of a subject as prior information. Our proposal builds on a two stage approach. We start with the reconstruction of a detailed 3D face avatar of the person, capturing both precise geometry and appearance from a collection of videos. We then use the encoder from a pre-trained monocular face reconstruction method, substituting its decoder with our personalized model, and proceed with transfer learning on the video collection. Using our pre-estimated image formation model, we obtain a more precise self-supervision objective, enabling improved expression and pose alignment. This results in a trained encoder capable of efficiently regressing pose and expression parameters in real-time from previously unseen images, which combined with our personalized geometry model yields more accurate and high fidelity mesh inference. Through extensive qualitative and quantitative evaluation, we showcase the superiority of our final model as compared to state-of-the-art baselines, and demonstrate its generalization ability to unseen pose, expression and lighting.
翻译: 单目相机人脸捕捉方法的目标是从一个人的单张照片中重构人体的姿态脸。目前的技术途径能够通过利用大量人面图像数据集在各种身份、照明条件和姿态下实时回归参数化3D人脸模型。然而,这些方法存在明显局限性,因为底层参数化的人面模型仅提供对脸部形状的粗略估计,从而限制了它们在需要精确3D重建的任务中的应用(衰老、换脸、数字化妆……)。在本文中,我们提出了一种高精度3D人脸捕捉方法,利用了一个主题的约束视频集合作为先验信息。我们的建议基于两个阶段。我们首先从一系列视频中重构人的详细3D人脸形象,同时捕捉精确的形状和外观。然后,我们将预训练的单目相机人脸重建方法的编码器替换为我们的自定义模型,并继续在视频集合上进行迁移学习。利用我们预先估计的图像形成模型,我们获得了更精确的自监督目标,使得可以从从未见过的图像中更有效地回归姿态和表情参数。这使得训练后的编码器能够实时高效地从 previously unseen images 中回归姿态和表情参数,与我们的自定义几何模型相结合,产生更准确、高保真的网格推理。通过广泛的定性和定量评估,我们展示了我们最终模型的优越性,并证明了其对未知姿态、表情和照明条件的 generalization 能力。
https://arxiv.org/abs/2409.07984
Surgical scenes convey crucial information about the quality of surgery. Pixel-wise localization of tools and anatomical structures is the first task towards deeper surgical analysis for microscopic or endoscopic surgical views. This is typically done via fully-supervised methods which are annotation greedy and in several cases, demanding medical expertise. Considering the profusion of surgical videos obtained through standardized surgical workflows, we propose an annotation-efficient framework for the semantic segmentation of surgical scenes. We employ image-based self-supervised object discovery to identify the most salient tools and anatomical structures in surgical videos. These proposals are further refined within a minimally supervised fine-tuning step. Our unsupervised setup reinforced with only 36 annotation labels indicates comparable localization performance with fully-supervised segmentation models. Further, leveraging surgical phase labels as weak labels can better guide model attention towards surgical tools, leading to $\sim 2\%$ improvement in tool localization. Extensive ablation studies on the CaDIS dataset validate the effectiveness of our proposed solution in discovering relevant surgical objects with minimal or no supervision.
手术场景传达了手术质量的关键信息。像素级定位工具和解剖结构是显微镜或内窥镜手术视图的深入手术分析的第一步。这通常通过完全监督的方法来完成,这些方法是注释贪婪的,在某些情况下,需要医学专业知识。考虑到通过标准化手术工作流程获得的手术视频的丰富性,我们提出了一个注释效率的框架,用于对手术场景进行语义分割。我们利用基于图像的自监督物体发现来识别手术视频中最显眼的工具和解剖结构。这些建议在最小监督的精细化调整步骤中进一步细化。仅使用36个注释标签的无忧设置表明,与完全监督分割模型相当,具有类似的定位性能。此外,将手术阶段标签作为弱标签可以更好地引导模型注意,从而将工具定位精度提高约2%。对CADIS数据集的广泛消融研究证实了我们在发现相关手术对象时,无需或仅需监督的有效性。
https://arxiv.org/abs/2409.07801
Learning meaningful and interpretable representations from high-dimensional volumetric magnetic resonance (MR) images is essential for advancing personalized medicine. While Vision Transformers (ViTs) have shown promise in handling image data, their application to 3D multi-contrast MR images faces challenges due to computational complexity and interpretability. To address this, we propose a novel state-space-model (SSM)-based masked autoencoder which scales ViT-like models to handle high-resolution data effectively while also enhancing the interpretability of learned representations. We propose a latent-to-spatial mapping technique that enables direct visualization of how latent features correspond to specific regions in the input volumes in the context of SSM. We validate our method on two key neuro-oncology tasks: identification of isocitrate dehydrogenase mutation status and 1p/19q co-deletion classification, achieving state-of-the-art accuracy. Our results highlight the potential of SSM-based self-supervised learning to transform radiomics analysis by combining efficiency and interpretability.
从高维度卷积磁共振(MR)图像中学习有意义和可解释的表示对于推动个性化医疗至关重要。虽然 Vision Transformers(ViTs)在处理图像数据方面显示出前景,但它们在处理3D多对比MR图像时面临计算复杂性及可解释性方面的挑战。为解决这个问题,我们提出了一种基于状态空间模型(SSM)的掩码自动编码器,它将ViT类模型扩展到处理高分辨率数据,同时提高学习到的表示的可解释性。我们提出了一种从潜在到空间映射技术,使得在SSM框架内直观地查看潜在特征如何对应输入体积的特定区域。我们在两个关键的神经肿瘤任务上验证我们的方法:一碘丙烷脱氢酶突变状态的鉴定和1p/19q共同缺失分类,实现最先进的准确度。我们的结果强调了基于SSM的自监督学习在结合效率和可解释性方面的潜在转化。
https://arxiv.org/abs/2409.07746
With the advent of billion-parameter foundation models, efficient fine-tuning has become increasingly important for the adaptation of models to downstream tasks. However, especially in computer vision, it can be hard to achieve good performance when access to quality labeled data is lacking. In this work, we propose a method adapting pretrained generalist models in a self-supervised manner by learning binary masks. These self-supervised masking networks (SMNs) are up to 79x more efficient to store and significantly improve performance on label-efficient downstream tasks. We validate the usefulness of learning binary masks as a fine-tuning method on 8 datasets and 3 model architectures, and we demonstrate the effectiveness of SMNs in 3 label-efficient settings.
随着亿参数基础模型的出现,在将模型应用于下游任务时,高效的微调变得越来越重要。然而,尤其是在计算机视觉领域,当访问到高质量标注数据时,实现良好的性能可能很难。在这项工作中,我们提出了一种通过学习二进制掩码将预训练泛化模型自我监督的方法。这些自监督掩码网络(SMNs)是存储效率高达79倍的更有效的模型,同时在标签效率下游任务上显著提高性能。我们在8个数据集和3种模型架构上验证了学习二进制掩码作为微调方法的有效性,并在3个标签效率设置中展示了SMNs的有效性。
https://arxiv.org/abs/2409.07577
Coded Aperture Snapshot Spectral Imaging (CASSI) is a crucial technique for capturing three-dimensional multispectral images (MSIs) through the complex inverse task of reconstructing these images from coded two-dimensional measurements. Current state-of-the-art methods, predominantly end-to-end, face limitations in reconstructing high-frequency details and often rely on constrained datasets like KAIST and CAVE, resulting in models with poor generalizability. In response to these challenges, this paper introduces a novel one-step Diffusion Probabilistic Model within a self-supervised adaptation framework for Snapshot Compressive Imaging (SCI). Our approach leverages a pretrained SCI reconstruction network to generate initial predictions from two-dimensional measurements. Subsequently, a one-step diffusion model produces high-frequency residuals to enhance these initial predictions. Additionally, acknowledging the high costs associated with collecting MSIs, we develop a self-supervised paradigm based on the Equivariant Imaging (EI) framework. Experimental results validate the superiority of our model compared to previous methods, showcasing its simplicity and adaptability to various end-to-end or unfolding techniques.
Coded Aperture Snapshot Spectral Imaging(CASSI)是一种通过复杂反向任务从编码二维测量中重构三维多光谱图像(MSIs)的关键技术。目前,主要采用端到端的方法,在重构高频细节方面存在局限性,并且通常依赖于像KAIST和CAVE这样的约束数据集,导致模型具有低泛化能力。为了应对这些挑战,本文在自监督适应框架下引入了一种新的One-step扩散概率模型用于Snapshot压缩成像(SCI)。我们的方法利用预训练的SCI重建网络从二维测量中生成初始预测。随后,一步扩散模型产生高频残差以增强这些初始预测。此外,考虑到收集MSIs的高昂成本,我们基于等变成像(EI)框架开发了一种自监督范式。实验结果证实了与以前方法相比,我们的模型具有优越性,展示了其简单性和对各种端到端或展开技术的适应性。
https://arxiv.org/abs/2409.07417
Humans perceive the world through multisensory integration, blending the information of different modalities to adapt their behavior. Contrastive learning offers an appealing solution for multimodal self-supervised learning. Indeed, by considering each modality as a different view of the same entity, it learns to align features of different modalities in a shared representation space. However, this approach is intrinsically limited as it only learns shared or redundant information between modalities, while multimodal interactions can arise in other ways. In this work, we introduce CoMM, a Contrastive MultiModal learning strategy that enables the communication between modalities in a single multimodal space. Instead of imposing cross- or intra- modality constraints, we propose to align multimodal representations by maximizing the mutual information between augmented versions of these multimodal features. Our theoretical analysis shows that shared, synergistic and unique terms of information naturally emerge from this formulation, allowing us to estimate multimodal interactions beyond redundancy. We test CoMM both in a controlled and in a series of real-world settings: in the former, we demonstrate that CoMM effectively captures redundant, unique and synergistic information between modalities. In the latter, CoMM learns complex multimodal interactions and achieves state-of-the-art results on the six multimodal benchmarks.
人类通过多感官集成来感知世界,将不同模态的信息融合在一起,以适应他们的行为。对比学习为多模态自监督学习提供了一个有吸引力的解决方案。事实上,通过将每个模态视为同一实体不同角度,它学会了在共享表示空间中协调不同模态的特征。然而,这种方法内在有限,因为它只学会了模态之间的共享或冗余信息,而其他模态交互的方式可能也存在。在这项工作中,我们引入了CoMM,一种在单个多模态空间中进行模态之间通信的对比多模态学习策略。我们不仅没有对模态之间的交叉或内部限制,而且通过最大化增强版本的这些多模态特征之间的互信息来对模态进行对齐。我们的理论分析表明,这种公式自然产生了共享、协同和独特信息的概念,使我们能够估计多模态交互的超出冗余。我们在受控环境和一系列真实世界设置中测试CoMM:在受控环境中,我们证明了CoMM有效地捕捉了模态之间的冗余、独特和协同信息。在真实环境中,CoMM学会了复杂的多模态交互,并在六个多模态基准上实现了最先进的性能。
https://arxiv.org/abs/2409.07402
Despite promising progress in face swapping task, realistic swapped images remain elusive, often marred by artifacts, particularly in scenarios involving high pose variation, color differences, and occlusion. To address these issues, we propose a novel approach that better harnesses diffusion models for face-swapping by making following core contributions. (a) We propose to re-frame the face-swapping task as a self-supervised, train-time inpainting problem, enhancing the identity transfer while blending with the target image. (b) We introduce a multi-step Denoising Diffusion Implicit Model (DDIM) sampling during training, reinforcing identity and perceptual similarities. (c) Third, we introduce CLIP feature disentanglement to extract pose, expression, and lighting information from the target image, improving fidelity. (d) Further, we introduce a mask shuffling technique during inpainting training, which allows us to create a so-called universal model for swapping, with an additional feature of head swapping. Ours can swap hair and even accessories, beyond traditional face swapping. Unlike prior works reliant on multiple off-the-shelf models, ours is a relatively unified approach and so it is resilient to errors in other off-the-shelf models. Extensive experiments on FFHQ and CelebA datasets validate the efficacy and robustness of our approach, showcasing high-fidelity, realistic face-swapping with minimal inference time. Our code is available at this https URL.
尽管在面部交换任务中取得了承诺的进展,但现实交换图像仍然难以获得,通常受到伪影和其他影响因素的困扰,特别是在高姿态变化、色彩差异和遮挡等场景中。为了应对这些问题,我们提出了一个新方法,它更好地利用扩散模型进行面部交换,并通过以下核心贡献解决了这些问题:(a)我们将面部交换任务重新建模为自监督的训练时间去噪问题,在保留身份转移的同时融合目标图像。(b)我们引入了多级去噪扩散隐含模型(DDIM)采样,在训练过程中加强身份和感知相似性。(c)第三,我们引入了CLIP特征解耦,从目标图像中提取姿态、表情和照明信息,提高保真度。(d)此外,我们在修复训练期间引入了口罩洗牌技术,允许我们创建一个所谓的通用交换模型,附加头交换功能。我们的可以交换头发甚至饰品,超越传统面部交换。与依赖多个标准模型的先验工作不同,我们的方法是一种相对统一的方法,因此它对其他标准模型的错误具有弹性。在FFHQ和CelebA数据集上的大量实验证实了本方法的有效性和鲁棒性,展示了高保真度、现实的高面部交换,且具有最小的推理时间。我们的代码可在此处访问:https:// this URL.
https://arxiv.org/abs/2409.07269
Diffusion Probabilistic Models have recently attracted significant attention in the community of computer vision due to their outstanding performance. However, while a substantial amount of diffusion-based research has focused on generative tasks, no work introduces diffusion models to advance the results of polyp segmentation in videos, which is frequently challenged by polyps' high camouflage and redundant temporal this http URL this paper, we present a novel diffusion-based network for video polyp segmentation task, dubbed as Diff-VPS. We incorporate multi-task supervision into diffusion models to promote the discrimination of diffusion models on pixel-by-pixel segmentation. This integrates the contextual high-level information achieved by the joint classification and detection tasks. To explore the temporal dependency, Temporal Reasoning Module (TRM) is devised via reasoning and reconstructing the target frame from the previous frames. We further equip TRM with a generative adversarial self-supervised strategy to produce more realistic frames and thus capture better dynamic cues. Extensive experiments are conducted on SUN-SEG, and the results indicate that our proposed Diff-VPS significantly achieves state-of-the-art performance. Code is available at this https URL.
扩散概率模型因其在计算机视觉领域的出色表现而最近在社区中引起了大量关注。然而,尽管在扩散模型的扩散任务中有很多研究,但是没有工作将扩散模型引入到视频多聚类分割中,以提高分割结果,这一挑战常常由结节的复杂性和多义性引起。本文提出了一种名为Diff-VPS的新扩散模型,用于视频聚类分割任务,它通过将多任务监督整合到扩散模型中来促进扩散模型在像素级分割上的鉴别能力。这一方法结合了联合分类和检测任务获得的上下文高级信息。为了探索时间依赖关系,我们通过推理和重构从先前帧获得目标帧。我们还为TRM配备了一种生成对抗自监督策略,以产生更逼真的帧,从而更好地捕捉动态线索。在SUN-SEG上进行了大量实验,结果表明,我们提出的Diff-VPS达到了最先进的水平。代码可在此处下载:https://www.example.com/
https://arxiv.org/abs/2409.07238
Computed tomography (CT) reconstruction plays a crucial role in industrial nondestructive testing and medical diagnosis. Sparse view CT reconstruction aims to reconstruct high-quality CT images while only using a small number of projections, which helps to improve the detection speed of industrial assembly lines and is also meaningful for reducing radiation in medical scenarios. Sparse CT reconstruction methods based on implicit neural representations (INRs) have recently shown promising performance, but still produce artifacts because of the difficulty of obtaining useful prior information. In this work, we incorporate a powerful prior: the total number of material categories of objects. To utilize the prior, we design AC-IND, a self-supervised method based on Attenuation Coefficient Estimation and Implicit Neural Distribution. Specifically, our method first transforms the traditional INR from scalar mapping to probability distribution mapping. Then we design a compact attenuation coefficient estimator initialized with values from a rough reconstruction and fast segmentation. Finally, our algorithm finishes the CT reconstruction by jointly optimizing the estimator and the generated distribution. Through experiments, we find that our method not only outperforms the comparative methods in sparse CT reconstruction but also can automatically generate semantic segmentation maps.
计算断层成像(CT)重建在工业无损测试和医学诊断中起着关键作用。稀疏视野CT重建旨在通过仅使用少量投影来重建高质量CT图像,从而提高工业装配线检测速度,并对减少医学场景中的辐射有意义。基于隐式神经表示(INRs)的稀疏CT重建方法最近显示出良好的性能,但由于难以获得有用的先验信息,仍然会产生伪影。在本文中,我们引入了一个强大的先验:对象的总体类别数。为了利用这个先验,我们设计了一种基于衰减系数估计和隐式神经分布的自监督方法。具体来说,我们的方法首先将传统的INR从标量映射转换为概率分布映射。然后,我们设计了一个紧凑的衰减系数估计器,其初始值来自粗略重建和快速分割。最后,我们的算法通过联合优化估计算法和生成的分布来完成CT重建。通过实验,我们发现,我们的方法不仅在稀疏CT重建方面超越了比较方法,而且还可以自动生成语义分割图。
https://arxiv.org/abs/2409.07171
We present the third edition of the VoiceMOS Challenge, a scientific initiative designed to advance research into automatic prediction of human speech ratings. There were three tracks. The first track was on predicting the quality of ``zoomed-in'' high-quality samples from speech synthesis systems. The second track was to predict ratings of samples from singing voice synthesis and voice conversion with a large variety of systems, listeners, and languages. The third track was semi-supervised quality prediction for noisy, clean, and enhanced speech, where a very small amount of labeled training data was provided. Among the eight teams from both academia and industry, we found that many were able to outperform the baseline systems. Successful techniques included retrieval-based methods and the use of non-self-supervised representations like spectrograms and pitch histograms. These results showed that the challenge has advanced the field of subjective speech rating prediction.
我们向您介绍VoiceMOS挑战的第三版,这是一个旨在促进对自动人类语音评分研究的科学倡议。设置了三个比赛轨道。第一个轨道是在语音合成系统上预测“拉近”高质量样本的质量。第二个轨道是在使用大量系统和听众的情况下,预测唱歌语音合成和语音转换样本的评分。第三个轨道是针对噪音、干净和增强语音的半监督质量预测,其中提供了非常少量的标记训练数据。在学术界和产业界共八支队伍中,我们发现许多队伍能够超越基线系统。成功的方法包括基于检索的方法和使用像频谱图和时域图这样的非自我监督表示。这些结果表明,挑战已经推动了主观语音评分预测领域的发展。
https://arxiv.org/abs/2409.07001
Recent progress of semantic point clouds analysis is largely driven by synthetic data (e.g., the ModelNet and the ShapeNet), which are typically complete, well-aligned and noisy free. Therefore, representations of those ideal synthetic point clouds have limited variations in the geometric perspective and can gain good performance on a number of 3D vision tasks such as point cloud classification. In the context of unsupervised domain adaptation (UDA), representation learning designed for synthetic point clouds can hardly capture domain invariant geometric patterns from incomplete and noisy point clouds. To address such a problem, we introduce a novel scheme for induced geometric invariance of point cloud representations across domains, via regularizing representation learning with two self-supervised geometric augmentation tasks. On one hand, a novel pretext task of predicting translation distances of augmented samples is proposed to alleviate centroid shift of point clouds due to occlusion and noises. On the other hand, we pioneer an integration of the relational self-supervised learning on geometrically-augmented point clouds in a cascade manner, utilizing the intrinsic relationship of augmented variants and other samples as extra constraints of cross-domain geometric features. Experiments on the PointDA-10 dataset demonstrate the effectiveness of the proposed method, achieving the state-of-the-art performance.
近年来,语义点云分析的进展主要是由合成数据(例如ModelNet和ShapeNet)驱动的,这些数据通常是完整、对齐良好的且有噪声的。因此,这些理想合成点云的表示在几何视角上具有有限的变异性,并且在点云分类等3D视觉任务上表现良好。在无监督领域自适应(UDA)背景下,为合成点云设计的表示学习很难捕捉来自不完整和有噪声的点云的领域不变的几何模式。为解决这个问题,我们引入了一种名为跨域几何变换的点云表示生成新方法,通过两个自监督几何增强任务对表示学习进行正则化。一方面,我们提出了一个预测增强样本平移距离的新预任务,用于减轻由于遮挡和噪声引起的点云的质心偏移。另一方面,我们在级联方式下,将关系自监督学习集成到几何增强点云中,利用增强变体的内在关系和其它样本作为跨域几何特征的额外约束。在PointDA-10数据集上的实验证明,所提出的方法的有效性得到了充分体现,并达到了最先进的水平。
https://arxiv.org/abs/2409.06956
3D perception in LiDAR point clouds is crucial for a self-driving vehicle to properly act in 3D environment. However, manually labeling point clouds is hard and costly. There has been a growing interest in self-supervised pre-training of 3D perception models. Following the success of contrastive learning in images, current methods mostly conduct contrastive pre-training on point clouds only. Yet an autonomous driving vehicle is typically supplied with multiple sensors including cameras and LiDAR. In this context, we systematically study single modality, cross-modality, and multi-modality for contrastive learning of point clouds, and show that cross-modality wins over other alternatives. In addition, considering the huge difference between the training sources in 2D images and 3D point clouds, it remains unclear how to design more effective contrastive units for LiDAR. We therefore propose the instance-aware and similarity-balanced contrastive units that are tailored for self-driving point clouds. Extensive experiments reveal that our approach achieves remarkable performance gains over various point cloud models across the downstream perception tasks of LiDAR based 3D object detection and 3D semantic segmentation on the four popular benchmarks including Waymo Open Dataset, nuScenes, SemanticKITTI and ONCE.
3D感知在激光雷达点云中的重要性对于自动驾驶车辆在3D环境中正确行动至关重要。然而,手动标注点云困难且代价昂贵。随着对比学习在图像上的成功,目前主要在点云上进行对比预训练。然而,自动驾驶车辆通常配备多个传感器,包括摄像头和激光雷达。在这种情况下,我们系统地研究了单模态、跨模态和多模态的点云对比学习,并证明了跨模态胜过其他选择。此外,考虑到2D图像和3D点云训练源之间的巨大差异,仍然不清楚如何为激光雷达设计更有效的对比单元。因此,我们提出了针对自动驾驶点云的实例感知和相似性平衡的对比单元。在广泛的实验中,我们的方法在包括LiDAR基于3D物体检测和3D语义分割的四种流行基准上的 downstream感知任务中取得了显著的性能提升。
https://arxiv.org/abs/2409.06827
In this paper, we introduce DetailCLIP: A Detail-Oriented CLIP to address the limitations of contrastive learning-based vision-language models, particularly CLIP, in handling detail-oriented and fine-grained tasks like segmentation. While CLIP and its variants excel in the global alignment of image and text representations, they often struggle to capture the fine-grained details necessary for precise segmentation. To overcome these challenges, we propose a novel framework that employs patch-level comparison of self-distillation and pixel-level reconstruction losses, enhanced with an attention-based token removal mechanism. This approach selectively retains semantically relevant tokens, enabling the model to focus on the image's critical regions aligned with the specific functions of our model, including textual information processing, patch comparison, and image reconstruction, ensuring that the model learns high-level semantics and detailed visual features. Our experiments demonstrate that DetailCLIP surpasses existing CLIP-based and traditional self-supervised learning (SSL) models in segmentation accuracy and exhibits superior generalization across diverse datasets. DetailCLIP represents a significant advancement in vision-language modeling, offering a robust solution for tasks that demand high-level semantic understanding and detailed feature extraction. this https URL.
在本文中,我们提出了DetailCLIP: 一个关注细节的CLIP,以解决基于对比学习视觉语言模型的局限性,特别是CLIP,在处理细节导向和细粒度任务(如分割)方面的局限性。虽然CLIP及其变体在全局图像和文本表示的同步性方面表现出色,但它们通常很难捕捉到精细粒度的细节,这对于精确分割是必要的。为了克服这些挑战,我们提出了一个新框架,采用自监督和像素级别重建损失的补丁级比较,并使用自注意机制增强。这种方法选择性地保留语义相关的标记,使模型能够关注与我们的模型特定功能相关的图像关键区域,包括文本信息处理、补丁比较和图像重建,确保模型学习高级语义和详细视觉特征。我们的实验证明,DetailCLIP超越了现有的CLIP基于模型和传统的自监督学习(SSL)模型,在分割准确性和泛化方面表现出色。DetailCLIP在视觉语言建模方面取得了显著的进步,为需要高级语义理解和详细特征提取的任务提供了一个稳健的解决方案。</br> <br> 这个链接:
https://arxiv.org/abs/2409.06809
Pre-training video transformers generally requires a large amount of data, presenting significant challenges in terms of data collection costs and concerns related to privacy, licensing, and inherent biases. Synthesizing data is one of the promising ways to solve these issues, yet pre-training solely on synthetic data has its own challenges. In this paper, we introduce an effective self-supervised learning framework for videos that leverages readily available and less costly static images. Specifically, we define the Pseudo Motion Generator (PMG) module that recursively applies image transformations to generate pseudo-motion videos from images. These pseudo-motion videos are then leveraged in masked video modeling. Our approach is applicable to synthetic images as well, thus entirely freeing video pre-training from data collection costs and other concerns in real data. Through experiments in action recognition tasks, we demonstrate that this framework allows effective learning of spatio-temporal features through pseudo-motion videos, significantly improving over existing methods which also use static images and partially outperforming those using both real and synthetic videos. These results uncover fragments of what video transformers learn through masked video modeling.
预训练视频 transformer 通常需要大量数据,这使得数据收集成本问题和隐私、许可等问题的担忧变得更加明显。生成数据是解决这些问题的一个有前途的方法,然而仅在 synthetic 数据上进行预训练有其挑战性。在本文中,我们引入了一个有效的自监督学习框架来处理视频问题,它利用易于获取且成本较低的静态图像。具体来说,我们定义了一个伪运动生成器(PMG)模块,它递归地应用图像变换生成伪运动视频。这些伪运动视频随后用于遮罩视频建模。我们的方法适用于合成图像,因此完全从数据收集成本和其他现实数据中的担忧中解放了视频预训练。通过动作识别任务的实验,我们证明了这种框架通过伪运动视频有效地学习空间和时间特征,显著超过了现有方法,这些方法也使用静态图像,并且部分优于使用真实和合成视频。这些结果揭示了视频 transformers 通过遮罩视频建模学习到的片段。
https://arxiv.org/abs/2409.06665
Early detection of eye diseases like glaucoma, macular degeneration, and diabetic retinopathy is crucial for preventing vision loss. While artificial intelligence (AI) foundation models hold significant promise for addressing these challenges, existing ophthalmic foundation models primarily focus on a single modality, whereas diagnosing eye diseases requires multiple modalities. A critical yet often overlooked aspect is harnessing the multi-view information across various modalities for the same patient. Additionally, due to the long-tail nature of ophthalmic diseases, standard fully supervised or unsupervised learning approaches often struggle. Therefore, it is essential to integrate clinical text to capture a broader spectrum of diseases. We propose EyeCLIP, a visual-language foundation model developed using over 2.77 million multi-modal ophthalmology images with partial text data. To fully leverage the large multi-modal unlabeled and labeled data, we introduced a pretraining strategy that combines self-supervised reconstructions, multi-modal image contrastive learning, and image-text contrastive learning to learn a shared representation of multiple modalities. Through evaluation using 14 benchmark datasets, EyeCLIP can be transferred to a wide range of downstream tasks involving ocular and systemic diseases, achieving state-of-the-art performance in disease classification, visual question answering, and cross-modal retrieval. EyeCLIP represents a significant advancement over previous methods, especially showcasing few-shot, even zero-shot capabilities in real-world long-tail scenarios.
早于发现像糖尿病视网膜病、黄斑变性和其他眼病等严重眼病对防止视力下降至关重要。虽然人工智能(AI)基础模型对于应对这些挑战具有很大的潜力,但现有的眼部基础模型主要集中在单一模式,而诊断眼病需要多个模式。关键但往往被忽视的一个方面是利用各种模式的多视角信息为同一患者提供信息。此外,由于眼病数据的非标准化和数据稀疏性,标准的完全监督或无监督学习方法通常很难。因此,将临床文本集成以捕捉更广泛的疾病范围至关重要。我们提出了EyeCLIP,一种使用超过2770万张多模态眼病图像以及部分文本数据开发的视觉语言基础模型。为了充分利用大量多模态未标记和标记数据,我们引入了一种预训练策略,将自监督重构、多模态图像对比学习和图像-文本对比学习相结合以学习多个模式的共享表示。通过使用14个基准数据集进行评估,EyeCLIP可以转移到涉及眼部和系统疾病的各种下游任务,并在疾病分类、视觉问答和跨模态检索方面实现最先进的性能。EyeCLIP在以前的方法上取得了显著的进步,尤其是在现实世界中的长尾场景中表现出几 shot甚至零 shot的能力。
https://arxiv.org/abs/2409.06644
In this work, we apply state-of-the-art self-supervised learning techniques on a large dataset of seafloor imagery, \textit{BenthicNet}, and study their performance for a complex hierarchical multi-label (HML) classification downstream task. In particular, we demonstrate the capacity to conduct HML training in scenarios where there exist multiple levels of missing annotation information, an important scenario for handling heterogeneous real-world data collected by multiple research groups with differing data collection protocols. We find that, when using smaller one-hot image label datasets typical of local or regional scale benthic science projects, models pre-trained with self-supervision on a larger collection of in-domain benthic data outperform models pre-trained on ImageNet. In the HML setting, we find the model can attain a deeper and more precise classification if it is pre-trained with self-supervision on in-domain data. We hope this work can establish a benchmark for future models in the field of automated underwater image annotation tasks and can guide work in other domains with hierarchical annotations of mixed resolution.
在这项工作中,我们在一个大量的海底图像数据集上应用了最先进的自监督学习技术,\textit{BenthicNet},并研究了它们在复杂的多标签分类下游任务上的性能。特别是,我们证明了在存在多个缺失注释信息级别的情况下,自监督方法在处理由多个研究组收集的异质现实数据时的处理能力。我们发现,当使用局部或区域规模的典型的自监督图像标签数据集时,在更大的领域自监督数据集上预训练的模型性能优于在ImageNet上预训练的模型。在HML设置中,我们发现,如果模型在领域数据上进行自监督预训练,则它可以实现更深更精确的分类。我们希望这项工作可以为未来在自动水下图像注释领域使用的模型建立一个基准,并为其他领域提供具有分层注释的混合分辨率注释的指导。
https://arxiv.org/abs/2409.06618
Self-supervised learning (SSL) is an effective method for exploiting unlabelled data to learn a high-level embedding space that can be used for various downstream tasks. However, existing methods to monitor the quality of the encoder -- either during training for one model or to compare several trained models -- still rely on access to annotated data. When SSL methodologies are applied to new data domains, a sufficiently large labelled dataset may not always be available. In this study, we propose several evaluation metrics which can be applied on the embeddings of unlabelled data and investigate their viability by comparing them to linear probe accuracy (a common metric which utilizes an annotated dataset). In particular, we apply $k$-means clustering and measure the clustering quality with the silhouette score and clustering agreement. We also measure the entropy of the embedding distribution. We find that while the clusters did correspond better to the ground truth annotations as training of the network progressed, label-free clustering metrics correlated with the linear probe accuracy only when training with SSL methods SimCLR and MoCo-v2, but not with SimSiam. Additionally, although entropy did not always have strong correlations with LP accuracy, this appears to be due to instability arising from early training, with the metric stabilizing and becoming more reliable at later stages of learning. Furthermore, while entropy generally decreases as learning progresses, this trend reverses for SimSiam. More research is required to establish the cause for this unexpected behaviour. Lastly, we find that while clustering based approaches are likely only viable for same-architecture comparisons, entropy may be architecture-independent.
自监督学习(SSL)是一种有效的利用未标注数据来学习高级嵌入空间的方法,可以用于各种下游任务。然而,现有的方法在训练一个模型或比较多个训练模型时监控编码器质量仍然依赖于访问已注释数据。当 SSL 方法论应用于新的数据领域时,可能很难获得足够大的已标注数据集。在本文中,我们提出了几个可以应用于未标注数据嵌入的评估指标,并研究了它们与线性探测精度(一种利用已注释数据集的常见度量)的比较。特别是,我们应用 k-means 聚类并使用离散余弦分数(离散余弦分数是一种聚类质量度量)测量聚类质量。我们还测量了嵌入分布的熵。我们发现,尽管随着网络的训练,聚类与真实标注之间更好的对应关系,但仅在训练 SSL 方法 SimCLR 和 MoCo-v2 时,与线性探测精度相关的聚类指标才与线性探测精度相关,而与 SimSiam 不相关。此外,尽管熵通常会随着学习过程而降低,但 SimSiam 中的趋势在训练的早期阶段是反的。因此,需要进行更多研究来确定这种意外行为的原因。最后,我们发现,基于聚类的方法论在相同架构比较中可能是可行的,但熵可能是独立的架构。
https://arxiv.org/abs/2409.06612
Singing voice conversion (SVC) is hindered by noise sensitivity due to the use of non-robust methods for extracting pitch and energy during the inference. As clean signals are key for the source audio in SVC, music source separation preprocessing offers a viable solution for handling noisy audio, like singing with background music (BGM). However, current separating methods struggle to fully remove noise or excessively suppress signal components, affecting the naturalness and similarity of the processed audio. To tackle this, our study introduces RobustSVC, a novel any-to-one SVC framework that converts noisy vocals into clean vocals sung by the target singer. We replace the non-robust feature with a HuBERT-based melody extractor and use adversarial training mechanisms with three discriminators to reduce information leakage in self-supervised representations. Experimental results show that RobustSVC is noise-robust and achieves higher similarity and naturalness than baseline methods in both noisy and clean vocal conditions.
singing voice conversion(SVC)受到由于在推理过程中采用不稳健的方法提取音高和能量而产生的噪声敏感性的影响。由于干净的信号对于SVC中的源音频至关重要,因此对于处理噪音音频,如在背景音乐中唱歌,使用分离方法是可行的。然而,目前的分离方法很难完全消除噪音或过度抑制信号成分,从而影响处理后音频的自然性和相似性。为了解决这个问题,我们的研究引入了RobustSVC,一种新颖的任意对一SVC框架,可以将噪音音频转换为由目标歌手演唱的干净音频。我们将非稳健特征替换为基于HuBERT的旋律提取器,并使用三个判别器进行对抗训练,以减少自监督表示中信息泄露。实验结果表明,RobustSVC对噪音和干净音频都具有鲁棒性,并且在两种音频条件下具有更高的相似度和自然度。
https://arxiv.org/abs/2409.06237
Dynamic facial expression recognition (DFER) is essential for understanding human emotions and behavior. However, conventional DFER methods, which primarily use dynamic facial data, often underutilize static expression images and their labels, limiting their performance and robustness. To overcome this, we introduce UniLearn, a novel unified learning paradigm that integrates static facial expression recognition (SFER) data to enhance DFER task. UniLearn employs a dual-modal self-supervised pre-training method, leveraging both facial expression images and videos to enhance a ViT model's spatiotemporal representation capability. Then, the pre-trained model is fine-tuned on both static and dynamic expression datasets using a joint fine-tuning strategy. To prevent negative transfer during joint fine-tuning, we introduce an innovative Mixture of Adapter Experts (MoAE) module that enables task-specific knowledge acquisition and effectively integrates information from both static and dynamic expression data. Extensive experiments demonstrate UniLearn's effectiveness in leveraging complementary information from static and dynamic facial data, leading to more accurate and robust DFER. UniLearn consistently achieves state-of-the-art performance on FERV39K, MAFW, and DFEW benchmarks, with weighted average recall (WAR) of 53.65\%, 58.44\%, and 76.68\%, respectively. The source code and model weights will be publicly available at \url{this https URL}.
动态面部表情识别(DFER)对于理解人类情感和行为至关重要。然而,传统的DFER方法主要使用动态面部数据,往往低估静态表情图像及其标签,限制其性能和鲁棒性。为了克服这一问题,我们引入了UniLearn,一种新颖的统一学习范式,将静态面部表情识别(SFER)数据集成到DFER任务中,以增强DFER任务的性能和鲁棒性。UniLearn采用了一种双模态自监督预训练方法,利用面部表情图像和视频增强ViT模型的时空表示能力。然后,通过联合微调策略在静态和动态表情数据上对预训练模型进行微调。为了防止在联合微调过程中发生负迁移,我们引入了一种创新的中间件专家(MoAE)模块,实现任务特定知识获取,并有效整合静态和动态表情数据的信息。大量实验证明,UniLearn能够利用静态和动态面部数据的互补信息,从而实现更准确和鲁棒的DFER。UniLearn在FERV39K、MAFW和DFEW基准测试中都实现了最先进的性能,其加权平均召回(WAR)分别为53.65%、58.44%和76.68%。源代码和模型权重将公开发布在https://this URL上。
https://arxiv.org/abs/2409.06154
Representing speech with discrete units has been widely used in speech codec and speech generation. However, there are several unverified claims about self-supervised discrete units, such as disentangling phonetic and speaker information with k-means, or assuming information loss after k-means. In this work, we take an information-theoretic perspective to answer how much information is present (information completeness) and how much information is accessible (information accessibility), before and after residual vector quantization. We show a lower bound for information completeness and estimate completeness on discretized HuBERT representations after residual vector quantization. We find that speaker information is sufficiently present in HuBERT discrete units, and that phonetic information is sufficiently present in the residual, showing that vector quantization does not achieve disentanglement. Our results offer a comprehensive assessment on the choice of discrete units, and suggest that a lot more information in the residual should be mined rather than discarded.
使用离散单元表示语音在语音编码和语音生成中的应用非常广泛。然而,关于自监督离散单元有一些未验证的声明,例如通过k-means将语音音素和说话人信息解耦,或者在k-means之后假设信息损失。在本文中,我们从信息论的角度回答了离散单元中信息的存在(信息完整性)和可获得性(信息可用性),以及在剩余向量量化之前和之后的信息量。我们在剩余向量量化后的离散HuBERT表示上找到了信息完整性的下界,并估计了完整性。我们发现,在HuBERT离散单元中,说话人的信息足够存在,而语音信息足够存在于剩余中,这说明向量量化并没有实现解耦。我们的结果对选择离散单元提供了全面的评估,并表明应该开采残留向量中更多信息,而不是将其废弃。
https://arxiv.org/abs/2409.06109