Foundation models have revolutionized computer vision by achieving vastly superior performance across diverse tasks through large-scale pretraining on extensive datasets. However, their application in surgical computer vision has been limited. This study addresses this gap by introducing SurgeNetXL, a novel surgical foundation model that sets a new benchmark in surgical computer vision. Trained on the largest reported surgical dataset to date, comprising over 4.7 million video frames, SurgeNetXL achieves consistent top-tier performance across six datasets spanning four surgical procedures and three tasks, including semantic segmentation, phase recognition, and critical view of safety (CVS) classification. Compared with the best-performing surgical foundation models, SurgeNetXL shows mean improvements of 2.4, 9.0, and 12.6 percent for semantic segmentation, phase recognition, and CVS classification, respectively. Additionally, SurgeNetXL outperforms the best-performing ImageNet-based variants by 14.4, 4.0, and 1.6 percent in the respective tasks. In addition to advancing model performance, this study provides key insights into scaling pretraining datasets, extending training durations, and optimizing model architectures specifically for surgical computer vision. These findings pave the way for improved generalizability and robustness in data-scarce scenarios, offering a comprehensive framework for future research in this domain. All models and a subset of the SurgeNetXL dataset, including over 2 million video frames, are publicly available at: this https URL.
基础模型通过在大规模数据集上的预训练,已在计算机视觉领域实现了跨多种任务的卓越性能。然而,在手术计算机视觉领域的应用却相对有限。本研究旨在填补这一空白,引入了SurgeNetXL,这是一种新型的手术基础模型,并为手术计算机视觉设定了新的基准。该模型是在迄今为止报道的最大规模的手术数据集上训练出来的,包含超过470万帧视频图像。SurgeNetXL在涵盖四个手术程序和三个任务(语义分割、阶段识别以及关键安全视图(CVS)分类)的六个数据集中均表现出持续领先的成绩。 相较于目前表现最佳的手术基础模型,SurgeNetXL在语义分割、阶段识别及CVS分类上分别提高了2.4%,9.0%和12.6%。此外,在各自的任务中,与基于ImageNet的最佳变体相比,SurgeNetXL的表现也高出14.4%,4.0%以及1.6%。 除提升模型性能外,本研究还提供了有关如何扩大预训练数据集规模、延长训练时长及优化手术计算机视觉领域中的模型架构的关键见解。这些发现为在数据稀缺场景下提高通用性和鲁棒性铺平了道路,并为该领域的未来研究提供了一个全面的框架。 所有模型以及SurgeNetXL数据集中的一部分(包括超过200万帧视频图像)均可从以下网址公开获取:[此链接](https://thishttpsURL.com)。
https://arxiv.org/abs/2501.09436
Robust WiFi-based human pose estimation is a challenging task that bridges discrete and subtle WiFi signals to human skeletons. This paper revisits this problem and reveals two critical yet overlooked issues: 1) cross-domain gap, i.e., due to significant variations between source-target domain pose distributions; and 2) structural fidelity gap, i.e., predicted skeletal poses manifest distorted topology, usually with misplaced joints and disproportionate bone lengths. This paper fills these gaps by reformulating the task into a novel two-phase framework dubbed DT-Pose: Domain-consistent representation learning and Topology-constrained Pose decoding. Concretely, we first propose a temporal-consistent contrastive learning strategy with uniformity regularization, coupled with self-supervised masking-reconstruction operations, to enable robust learning of domain-consistent and motion-discriminative WiFi-specific representations. Beyond this, we introduce a simple yet effective pose decoder with task prompts, which integrates Graph Convolution Network (GCN) and Transformer layers to constrain the topology structure of the generated skeleton by exploring the adjacent-overarching relationships among human joints. Extensive experiments conducted on various benchmark datasets highlight the superior performance of our method in tackling these fundamental challenges in both 2D/3D human pose estimation tasks.
基于WiFi的人体姿态估计是一项具有挑战性的任务,它需要将离散且微妙的WiFi信号与人体骨骼联系起来。本文重新审视了这一问题,并揭示了两个关键但被忽视的问题:1)跨域差距,即由于源领域和目标领域的姿态分布差异显著;2)结构保真度差距,即预测的人体骨骼姿势表现出扭曲的拓扑结构,通常表现为关节位置不正确及骨头长度不成比例。本文通过重构任务为一个新颖的两阶段框架来填补这些缺口,该框架被称为DT-Pose(领域一致表示学习和拓扑约束姿态解码)。具体而言,我们首先提出了一种带有均匀性正则化的时序一致性对比学习策略,并结合自监督掩蔽-重建操作,以实现领域一致性和运动判别性的WiFi特有表征的稳健学习。此外,我们引入了一个简单但有效的姿势解码器,该解码器通过集成图卷积网络(GCN)和Transformer层来约束生成骨骼的拓扑结构,并探索人体关节之间的相邻-整体关系。在多个基准数据集上进行的广泛实验表明,在二维/三维人体姿态估计任务中解决这些基本挑战时,我们的方法表现出卓越性能。
https://arxiv.org/abs/2501.09411
Neural implicit k-space representations (NIK) have shown promising results for dynamic magnetic resonance imaging (MRI) at high temporal resolutions. Yet, reducing acquisition time, and thereby available training data, results in severe performance drops due to overfitting. To address this, we introduce a novel self-supervised k-space loss function $\mathcal{L}_\mathrm{PISCO}$, applicable for regularization of NIK-based reconstructions. The proposed loss function is based on the concept of parallel imaging-inspired self-consistency (PISCO), enforcing a consistent global k-space neighborhood relationship without requiring additional data. Quantitative and qualitative evaluations on static and dynamic MR reconstructions show that integrating PISCO significantly improves NIK representations. Particularly for high acceleration factors (R$\geq$54), NIK with PISCO achieves superior spatio-temporal reconstruction quality compared to state-of-the-art methods. Furthermore, an extensive analysis of the loss assumptions and stability shows PISCO's potential as versatile self-supervised k-space loss function for further applications and architectures. Code is available at: this https URL
神经隐式k空间表示(NIK)在高时间分辨率的动态磁共振成像(MRI)中展现了令人鼓舞的结果。然而,减少采集时间会导致可用训练数据量减少,从而因过拟合而导致性能严重下降。为解决这一问题,我们提出了一种新颖的自监督k空间损失函数$\mathcal{L}_\mathrm{PISCO}$,该函数适用于NIK基础重建技术中的正则化处理。所提出的损失函数基于平行成像启发的自我一致性(Parallel Imaging-inspired Self-Consistency, PISCO)概念,在无需额外数据的情况下强制执行一致的整体k空间邻域关系。 定量和定性评估结果表明,在静态和动态MR重建中,集成PISCO显著提升了NIK表示的质量。特别是在高加速因子(R≥54)的情况下,与现有方法相比,配备了PISCO的NIK在时空重建质量上表现出优越性能。此外,对损失假设及稳定性的广泛分析也显示了PISCO作为多功能自监督k空间损失函数应用于进一步应用和架构中的潜力。 代码可在以下网址获取:[this https URL]
https://arxiv.org/abs/2501.09403
Few-shot class incremental learning implies the model to learn new classes while retaining knowledge of previously learned classes with a small number of training instances. Existing frameworks typically freeze the parameters of the previously learned classes during the incorporation of new classes. However, this approach often results in suboptimal class separation of previously learned classes, leading to overlap between old and new classes. Consequently, the performance of old classes degrades on new classes. To address these challenges, we propose a novel feature augmentation driven contrastive learning framework designed to enhance the separation of previously learned classes to accommodate new classes. Our approach involves augmenting feature vectors and assigning proxy labels to these vectors. This strategy expands the feature space, ensuring seamless integration of new classes within the expanded space. Additionally, we employ a self-supervised contrastive loss to improve the separation between previous classes. We validate our framework through experiments on three FSCIL benchmark datasets: CIFAR100, miniImageNet, and CUB200. The results demonstrate that our Feature Augmentation driven Contrastive Learning framework significantly outperforms other approaches, achieving state-of-the-art performance.
少量样本类别增量学习意味着模型在使用少量训练实例的情况下,能够同时学习新类别的知识并保留已学类别的知识。现有的框架通常在引入新类别时冻结先前已学类别的参数设置不变。然而,这种方法往往会导致之前已学类别间分离效果不佳,从而造成旧类别与新类别之间的重叠。因此,旧类别的性能在面对新类别时会下降。 为了解决这些问题,我们提出了一种新颖的基于特征增强驱动对比学习框架,旨在改进先前已学类别间的分离度以适应新类别的加入。我们的方法包括对特征向量进行增强,并给这些向量分配代理标签。这种策略可以扩展特征空间,在扩大的空间内实现新类别的平滑整合。此外,我们采用自监督的对比损失来优化旧类之间的区分度。 我们在三个少量样本类别增量学习基准数据集(CIFAR100、miniImageNet和CUB200)上对我们的框架进行了验证实验。结果表明,基于特征增强驱动对比学习的框架显著优于其他方法,并达到了最先进的性能水平。
https://arxiv.org/abs/2501.09361
Mapping land surface disturbances supports disaster response, resource and ecosystem management, and climate adaptation efforts. Synthetic aperture radar (SAR) is an invaluable tool for disturbance mapping, providing consistent time-series images of the ground regardless of weather or illumination conditions. Despite SAR's potential for disturbance mapping, processing SAR data to an analysis-ready format requires expertise and significant compute resources, particularly for large-scale global analysis. In October 2023, NASA's Observational Products for End-Users from Remote Sensing Analysis (OPERA) project released the near-global Radiometric Terrain Corrected SAR backscatter from Sentinel-1 (RTC-S1) dataset, providing publicly available, analysis-ready SAR imagery. In this work, we utilize this new dataset to systematically analyze land surface disturbances. As labeling SAR data is often prohibitively time-consuming, we train a self-supervised vision transformer - which requires no labels to train - on OPERA RTC-S1 data to estimate a per-pixel distribution from the set of baseline imagery and assess disturbances when there is significant deviation from the modeled distribution. To test our model's capability and generality, we evaluate three different natural disasters - which represent high-intensity, abrupt disturbances - from three different regions of the world. Across events, our approach yields high quality delineations: F1 scores exceeding 0.6 and Areas Under the Precision-Recall Curve exceeding 0.65, consistently outperforming existing SAR disturbance methods. Our findings suggest that a self-supervised vision transformer is well-suited for global disturbance mapping and can be a valuable tool for operational, near-global disturbance monitoring, particularly when labeled data does not exist.
土地表面扰动的测绘支持灾害响应、资源和生态系统管理以及气候适应努力。合成孔径雷达(SAR)是绘制扰动图的重要工具,能够提供不受天气或光照条件影响的一致时间序列地面图像。尽管SAR在绘制扰动方面具有巨大潜力,但处理SAR数据以生成分析就绪的格式需要专业知识和大量的计算资源,特别是在进行大规模全球性分析时。2023年10月,美国宇航局(NASA)的观测产品用于遥感分析终端用户项目(OPERA)发布了近乎全球性的辐射地形校正SAR后向散射数据集(RTC-S1),提供了公开可用、可以直接用于分析的SAR图像。在此研究中,我们利用这一新的数据集系统地分析了土地表面扰动情况。由于标记SAR数据通常耗费大量时间,我们在OPERA RTC-S1数据上训练了一个无需标签即可进行学习的自监督视觉变换器模型,以此估计一组基准图像中的每个像素分布,并在出现显著偏离时评估扰动情况。 为了测试我们模型的能力和普适性,我们从世界各地三个不同的区域选择了三种不同的自然灾害——这些灾害代表了高强度、突然发生的扰动事件。在整个事件中,我们的方法产生了高质量的划分:F1分数超过0.6,精度-召回率曲线下的面积(AUPRC)超过0.65,显著优于现有的SAR干扰方法。研究结果表明,自监督视觉变换器非常适合进行全球范围内的扰动测绘,并且当缺乏标记数据时可以成为用于操作性、近乎全球性的扰动监测的有价值工具。
https://arxiv.org/abs/2501.09129
Turn-taking is a fundamental aspect of conversation, but current Human-Robot Interaction (HRI) systems often rely on simplistic, silence-based models, leading to unnatural pauses and interruptions. This paper investigates, for the first time, the application of general turn-taking models, specifically TurnGPT and Voice Activity Projection (VAP), to improve conversational dynamics in HRI. These models are trained on human-human dialogue data using self-supervised learning objectives, without requiring domain-specific fine-tuning. We propose methods for using these models in tandem to predict when a robot should begin preparing responses, take turns, and handle potential interruptions. We evaluated the proposed system in a within-subject study against a traditional baseline system, using the Furhat robot with 39 adults in a conversational setting, in combination with a large language model for autonomous response generation. The results show that participants significantly prefer the proposed system, and it significantly reduces response delays and interruptions.
换言之,对话中的轮流发言是交流的基本方面,但当前的人机互动(HRI)系统通常依赖于基于静默的简单模型,这导致了不自然的停顿和打断。本文首次研究了一般轮流发言模型——特别是TurnGPT和声音活动预测(VAP)的应用,以改善人机对话中的交流动态。这些模型通过自我监督学习目标在人类之间的对话数据上进行训练,并且不需要特定领域的微调。我们提出了利用这两种模型的组合方法来预测机器人何时应开始准备回应、何时发言以及如何处理可能的打断。我们在一次实验中评估了所提出的系统,该实验采用了一个传统的基线系统,在39名成年人与Furhat机器人的对话环境中进行,并结合大型语言模型自动生成自主响应。结果表明,参与者明显更偏好于我们提出的系统,它在减少回应延迟和打断方面也显著有效。
https://arxiv.org/abs/2501.08946
Analyzing large-scale datasets, especially involving complex and high-dimensional data like images, is particularly challenging. While self-supervised learning (SSL) has proven effective for learning representations from unlabelled data, it typically focuses on flat, non-hierarchical structures, missing the multi-level relationships present in many real-world datasets. Hierarchical clustering (HC) can uncover these relationships by organizing data into a tree-like structure, but it often relies on rigid similarity metrics that struggle to capture the complexity of diverse data types. To address these we envision $\texttt{InfoHier}$, a framework that combines SSL with HC to jointly learn robust latent representations and hierarchical structures. This approach leverages SSL to provide adaptive representations, enhancing HC's ability to capture complex patterns. Simultaneously, it integrates HC loss to refine SSL training, resulting in representations that are more attuned to the underlying information hierarchy. $\texttt{InfoHier}$ has the potential to improve the expressiveness and performance of both clustering and representation learning, offering significant benefits for data analysis, management, and information retrieval.
分析大规模数据集,尤其是涉及如图像这般复杂且高维度的数据时,是一项极具挑战性的任务。尽管自监督学习(SSL)在从无标签数据中学习表示方面表现出了有效性,但它通常侧重于扁平、非层次化的结构,忽略了实际数据集中存在的多层次关系。分层聚类(HC)能够通过将数据组织成树状结构来揭示这些关系,但其往往依赖于刚性的相似性度量标准,难以捕捉不同类型复杂数据的本质特征。 为了解决这些问题,我们提出了一种新的框架$\texttt{InfoHier}$,它结合了SSL和HC,以共同学习出稳健的潜在表示与层级结构。该方法利用SSL提供自适应表示能力,从而增强HC捕捉复杂模式的能力;同时,它还整合了HC损失函数来优化SSL训练过程,生成更加符合底层信息层次的表征。 $\texttt{InfoHier}$有望提升聚类和表示学习两方面的表现力与性能,在数据分析、管理和信息检索方面带来显著优势。
https://arxiv.org/abs/2501.08717
Unsupervised representation learning has significantly advanced various machine learning tasks. In the computer vision domain, state-of-the-art approaches utilize transformations like random crop and color jitter to achieve invariant representations, embedding semantically the same inputs despite transformations. However, this can degrade performance in tasks requiring precise features, such as localization or flower classification. To address this, recent research incorporates equivariant representation learning, which captures transformation-sensitive information. However, current methods depend on transformation labels and thus struggle with interdependency and complex transformations. We propose Self-supervised Transformation Learning (STL), replacing transformation labels with transformation representations derived from image pairs. The proposed method ensures transformation representation is image-invariant and learns corresponding equivariant transformations, enhancing performance without increased batch complexity. We demonstrate the approach's effectiveness across diverse classification and detection tasks, outperforming existing methods in 7 out of 11 benchmarks and excelling in detection. By integrating complex transformations like AugMix, unusable by prior equivariant methods, this approach enhances performance across tasks, underscoring its adaptability and resilience. Additionally, its compatibility with various base models highlights its flexibility and broad applicability. The code is available at this https URL.
无监督表示学习在各种机器学习任务中取得了显著进展。在计算机视觉领域,最先进的方法利用随机裁剪和颜色抖动等变换来实现不变性表示,在存在变化的情况下也能嵌入语义相同的数据输入。然而,这种方法对于需要精确特征的任务(如定位或花卉分类)来说可能会降低性能表现。为了解决这个问题,最近的研究引入了等变表示学习,该方法捕捉到了转换敏感信息。不过,当前的方法依赖于变换标签,因此在处理互相关性和复杂变换时会遇到困难。 我们提出了一种自监督变换学习(Self-supervised Transformation Learning, STL)方法,它用从图像对中衍生出的变换表示来替代变换标签。这种方法确保了变换表示是图像不变的,并能学习相应的等变变换,从而在不增加批次复杂度的情况下提升了性能表现。我们在广泛的分类和检测任务上展示了该方法的有效性,在11个基准中的7个上超过了现有的方法,并且在检测方面表现出色。 通过结合AugMix这类以前的等变方法无法使用的复杂变换,这种方法可以提升跨任务的表现力,突显了其适应性和稳定性。此外,它与各种基础模型兼容,展示了其灵活性和广泛适用性。相关代码可以在以下网址获得:[提供的链接]。
https://arxiv.org/abs/2501.08712
While originally designed for unidirectional generative modeling, decoder-only large language models (LLMs) are increasingly being adapted for bidirectional modeling. However, unidirectional and bidirectional models are typically trained separately with distinct objectives (generation and representation learning, respectively). This separation overlooks the opportunity for developing a more versatile language model and for these objectives to complement each other. In this work, we introduce MAGNET, an adaptation of decoder-only LLMs that enhances their ability to generate robust representations and infill missing text spans, while preserving their knowledge and text generation capabilities. MAGNET employs three self-supervised training objectives and introduces an attention mechanism that combines bidirectional and causal attention, enabling unified training across all objectives. Our results demonstrate that LLMs adapted with MAGNET (1) surpass strong text encoders on token-level and sentence-level representation learning tasks, (2) generate contextually appropriate text infills by leveraging future context, (3) retain the ability for open-ended text generation without exhibiting repetition problem, and (4) preserve the knowledge gained by the LLM during pretraining.
尽管最初为单向生成建模设计,但解码器专用的大语言模型(LLMs)越来越多地被改编用于双向建模。然而,单向和双向模型通常是分别使用不同的目标进行训练的(分别是生成任务和表示学习)。这种分离忽视了开发更通用的语言模型的机会,并且这些目标之间可以相互补充。在本工作中,我们介绍了MAGNET,这是对解码器专用LLMs的一种改进方法,它增强了这些模型生成稳健表征和填补缺失文本片段的能力,同时保留它们的知识和文本生成能力。MAGNET采用三个自监督训练目标,并引入了一种结合双向和因果注意力的机制,使所有目标能够统一进行训练。我们的实验结果表明,使用MAGNET改进后的LLMs(1)在标记级和句子级表示学习任务上超越了强大的文本编码器;(2)通过利用未来上下文生成语境相关的文本填补;(3)保持开放文本生成的能力而不出现重复问题;以及(4)保留模型在预训练过程中获得的知识。
https://arxiv.org/abs/2501.08648
Semantic segmentation is essential for comprehending images, but the process necessitates a substantial amount of detailed annotations at the pixel level. Acquiring such annotations can be costly in the real-world. Unsupervised domain adaptation (UDA) for semantic segmentation is a technique that uses virtual data with labels to train a model and adapts it to real data without labels. Some recent works use contrastive learning, which is a powerful method for self-supervised learning, to help with this technique. However, these works do not take into account the diversity of features within each class when using contrastive learning, which leads to errors in class prediction. We analyze the limitations of these works and propose a novel framework called Pseudo-label Guided Pixel Contrast (PGPC), which overcomes the disadvantages of previous methods. We also investigate how to use more information from target images without adding noise from pseudo-labels. We test our method on two standard UDA benchmarks and show that it outperforms existing methods. Specifically, we achieve relative improvements of 5.1% mIoU and 4.6% mIoU on the Grand Theft Auto V (GTA5) to Cityscapes and SYNTHIA to Cityscapes tasks based on DAFormer, respectively. Furthermore, our approach can enhance the performance of other UDA approaches without increasing model complexity. Code is available at this https URL
语义分割对于理解图像至关重要,但这一过程需要大量的像素级详细标注。在现实世界中获取这些标注可能会非常昂贵。无监督领域适应(UDA)是一种利用带有标签的虚拟数据来训练模型,并将其调整应用于没有标签的真实数据的技术。最近的一些研究工作使用对比学习这种强大的自监督学习方法来进行此技术,但它们未能考虑到每类内部特征的多样性,在使用对比学习时会导致类别预测错误。我们分析了这些工作的局限性,并提出了一种名为伪标签引导像素对比(PGPC)的新框架来克服先前方法的缺点。此外,我们还研究如何利用目标图像中的更多信息而不引入来自伪标签的噪声。我们在两个标准UDA基准测试上测试了我们的方法,并表明它优于现有方法。具体来说,在基于DAFormer的GTA5到Cityscapes和SYNTHIA到Cityscapes任务中,分别实现了5.1%mIoU和4.6%mIoU的相对改进。此外,我们的方法可以增强其他UDA方法的表现而不增加模型复杂度。代码可在以下链接获取:[此处插入URL]
https://arxiv.org/abs/2501.09040
Hyperspectral images are typically composed of hundreds of narrow and contiguous spectral bands, each containing information regarding the material composition of the imaged scene. However, these images can be affected by various sources of noise, distortions, or data loss, which can significantly degrade their quality and usefulness. This paper introduces a convergent guaranteed algorithm, LRS-PnP-DIP(1-Lip), which successfully addresses the instability issue of DHP that has been reported before. The proposed algorithm extends the successful joint low-rank and sparse model to further exploit the underlying data structures beyond the conventional and sometimes restrictive unions of subspace models. A stability analysis guarantees the convergence of the proposed algorithm under mild assumptions , which is crucial for its application in real-world scenarios. Extensive experiments demonstrate that the proposed solution consistently delivers visually and quantitatively superior inpainting results, establishing state-of-the-art performance.
高光谱图像通常由数百个狭窄且连续的光谱带组成,每个光谱带包含有关成像场景中材料成分的信息。然而,这些图像可能受到各种噪声、失真或数据丢失的影响,从而显著降低其质量和实用性。本文介绍了一种收敛保证算法LRS-PnP-DIP(1-Lip),该算法成功解决了之前报告的DHP(深度高光谱处理)方法的稳定性问题。所提出的算法扩展了成功的联合低秩和稀疏模型,进一步挖掘超越传统且有时限制性的子空间联合模型的数据结构。稳定性和收敛性分析在适度假设下保证了算法的有效性,这对其实用场景中的应用至关重要。广泛的实验表明,所提出的方法能持续提供视觉上和定量上的优越图像修复结果,并建立了当前的最优性能水平。
https://arxiv.org/abs/2501.08195
The development of EEG decoding algorithms confronts challenges such as data sparsity, subject variability, and the need for precise annotations, all of which are vital for advancing brain-computer interfaces and enhancing the diagnosis of diseases. To address these issues, we propose a novel two-stage approach named Self-Supervised State Reconstruction-Primed Riemannian Dynamics (EEG-ReMinD) , which mitigates reliance on supervised learning and integrates inherent geometric features. This approach efficiently handles EEG data corruptions and reduces the dependency on labels. EEG-ReMinD utilizes self-supervised and geometric learning techniques, along with an attention mechanism, to analyze the temporal dynamics of EEG features within the framework of Riemannian geometry, referred to as Riemannian dynamics. Comparative analyses on both intact and corrupted datasets from two different neurodegenerative disorders underscore the enhanced performance of EEG-ReMinD.
脑电图(EEG)解码算法的发展面临着数据稀疏、个体差异和精确标注需求等挑战,这些因素对于推进脑机接口技术以及改善疾病的诊断至关重要。为了应对这些问题,我们提出了一种新颖的两阶段方法,名为自我监督状态重建与黎曼动力学引导法(EEG-ReMinD),该方法减少了对监督学习的依赖,并整合了固有的几何特征。这种方法能够高效地处理受干扰的脑电图数据,并且降低了对标签的依赖性。 EEG-ReMinD利用自监督和几何学习技术,以及注意机制,在黎曼几何框架内分析脑电图特征的时间动态变化,即所谓的“黎曼动力学”。通过对两种不同神经退行性疾病的数据集(包括完整和受损数据)进行比较分析,证明了EEG-ReMinD的性能得到了显著提升。
https://arxiv.org/abs/2501.08139
This paper presents EarthView, a comprehensive dataset specifically designed for self-supervision on remote sensing data, intended to enhance deep learning applications on Earth monitoring tasks. The dataset spans 15 tera pixels of global remote-sensing data, combining imagery from a diverse range of sources, including NEON, Sentinel, and a novel release of 1m spatial resolution data from Satellogic. Our dataset provides a wide spectrum of image data with varying resolutions, harnessed from different sensors and organized coherently into an accessible HuggingFace dataset in parquet format. This data spans five years, from 2017 to 2022. Accompanying the dataset, we introduce EarthMAE, a tailored Masked Autoencoder, developed to tackle the distinct challenges of remote sensing data. Trained in a self-supervised fashion, EarthMAE effectively processes different data modalities such as hyperspectral, multispectral, topographical data, segmentation maps, and temporal structure. This model helps us show that pre-training on Satellogic data improves performance on downstream tasks. While there is still a gap to fill in MAE for heterogeneous data, we regard this innovative combination of an expansive, diverse dataset and a versatile model adapted for self-supervised learning as a stride forward in deep learning for Earth monitoring.
本文介绍了EarthView,这是一个专门为遥感数据的自监督学习设计的全面数据集,旨在增强地球监测任务中的深度学习应用。该数据集涵盖了来自NEON、Sentinel以及Satellogic公司新发布的1米空间分辨率数据等多样来源的全球遥感数据共计15万亿像素。我们的数据集提供了不同传感器获取的各种分辨率的图像数据,并以可访问的HuggingFace parquet格式组织起来。这些数据跨越了2017年至2022年的五年时间。 与该数据集配套,我们还推出了EarthMAE,这是一种专门针对遥感数据挑战定制的Masked Autoencoder(掩码自动编码器)。EarthMAE以自监督的方式训练,能够有效处理包括高光谱、多光谱、地形数据、分割图和时序结构在内的多种模式的数据。该模型帮助我们展示了在下游任务中,使用Satellogic数据进行预训练可以提升性能。 尽管针对异构数据的Masked Autoencoder仍存在一些待解决的问题,但我们认为这种结合了庞大且多样化数据集与适应自监督学习的灵活模型的方法是深度学习应用于地球监测领域的进步。
https://arxiv.org/abs/2501.08111
Recent advancements have highlighted the efficacy of self-supervised learning (SSL) features in various speech-related tasks, providing lightweight and versatile multi-view speech representations. However, our study reveals that while SSL features expedite model convergence, they conflict with traditional spectral features like FBanks in terms of update directions. In response, we propose a novel generalized feature fusion framework grounded in conditional computation, featuring a gradient-sensitive gating network and a multi-stage dropout strategy. This framework mitigates feature conflicts and bolsters model robustness to multi-view input features. By integrating SSL and spectral features, our approach accelerates convergence and maintains performance on par with spectral models across multiple speech translation tasks on the MUSTC dataset.
最近的研究进展强调了自监督学习(SSL)特征在各种语音相关任务中的有效性,提供了轻量级且多视角的语音表示。然而,我们的研究发现虽然SSL特征加速了模型收敛速度,但它们与传统的谱特征(如FBanks)在更新方向上存在冲突。为此,我们提出了一种基于条件计算的新颖通用特征融合框架,该框架包括一个梯度敏感门控网络和一个多阶段dropout策略。此框架可以缓解特征之间的冲突,并增强模型对多视角输入特征的鲁棒性。通过整合SSL和谱特征,我们的方法不仅加快了收敛速度,还在MUSTC数据集上的多个语音翻译任务中保持了与仅使用谱模型相当的表现水平。
https://arxiv.org/abs/2501.08057
The steadily increasing utilization of data-driven methods and approaches in areas that handle sensitive personal information such as in law enforcement mandates an ever increasing effort in these institutions to comply with data protection guidelines. In this work, we present a system for automatically anonymizing images of scanned documents, reducing manual effort while ensuring data protection compliance. Our method considers the viability of further forensic processing after anonymization by minimizing automatically redacted areas by combining automatic detection of sensitive regions with knowledge from a manually anonymized reference document. Using a self-supervised image model for instance retrieval of the reference document, our approach requires only one anonymized example to efficiently redact all documents of the same type, significantly reducing processing time. We show that our approach outperforms both a purely automatic redaction system and also a naive copy-paste scheme of the reference anonymization to other documents on a hand-crafted dataset of ground truth redactions.
在处理敏感个人信息(如执法领域)的领域中,数据驱动方法和策略的不断应用要求这些机构必须不断增加努力以遵守数据保护规定。在这项工作中,我们提出了一种自动匿名化扫描文档图像的系统,该系统能够减少手动工作量同时确保符合数据保护标准。我们的方法通过结合对敏感区域的自动检测与参考文档中的手动匿名信息,最小化了在匿名处理后可能需要进一步法医处理的区域面积。使用自监督图像模型进行实例检索,我们的方法只需要一个匿名化的示例即可高效地删除同一类型的所有文档上的敏感信息,从而大大减少了处理时间。我们在手工制作的真实数据删除基准测试集上展示了该方法优于纯自动红字系统以及将参考匿名化直接复制粘贴到其他文档的简单策略的表现。
https://arxiv.org/abs/2501.07334
We present a novel approach for depth estimation from images captured by structured light systems. Unlike many previous methods that rely on image matching process, our approach uses a density voxel grid to represent scene geometry, which is trained via self-supervised differentiable volume rendering. Our method leverages color fields derived from projected patterns in structured light systems during the rendering process, enabling the isolated optimization of the geometry field. This contributes to faster convergence and high-quality output. Additionally, we incorporate normalized device coordinates (NDC), a distortion loss, and a novel surface-based color loss to enhance geometric fidelity. Experimental results demonstrate that our method outperforms existing matching-based techniques in geometric performance for few-shot scenarios, achieving approximately a 60% reduction in average estimated depth errors on synthetic scenes and about 30% on real-world captured scenes. Furthermore, our approach delivers fast training, with a speed roughly three times faster than previous matching-free methods that employ implicit representations.
我们提出了一种从结构光系统捕捉的图像中进行深度估计的新方法。与许多依赖于图像匹配过程的先前方法不同,我们的方法使用密度体素网格来表示场景几何,并通过自监督可微分体积渲染进行训练。该方法利用在结构光系统中投影图案所衍生的颜色场,在渲染过程中实现对几何字段的孤立优化,从而加快了收敛速度并提高了输出质量。此外,我们还集成了归一化设备坐标(NDC)、畸变损失以及一种基于表面的新颜色损失,以增强几何精度。 实验结果表明,在少量示例场景下,我们的方法在几何性能方面优于现有的基于匹配的技术,在合成场景中平均估计深度误差减少了约60%,而在真实世界捕捉的场景中则减少约30%。此外,我们提出的方法具有快速训练的特点,其速度比之前采用隐式表示的无匹配自由方法快大约三倍。
https://arxiv.org/abs/2501.07113
Source-free domain adaptation (SFDA) utilizes a pre-trained source model with unlabeled target data. Self-supervised SFDA techniques generate pseudolabels from the pre-trained source model, but these pseudolabels often contain noise due to domain discrepancies between the source and target domains. Traditional self-supervised SFDA techniques rely on deterministic model predictions using the softmax function, leading to unreliable pseudolabels. In this work, we propose to introduce predictive uncertainty and softmax calibration for pseudolabel refinement using evidential deep learning. The Dirichlet prior is placed over the output of the target network to capture uncertainty using evidence with a single forward pass. Furthermore, softmax calibration solves the translation invariance problem to assist in learning with noisy labels. We incorporate a combination of evidential deep learning loss and information maximization loss with calibrated softmax in both prior and non-prior target knowledge SFDA settings. Extensive experimental analysis shows that our method outperforms other state-of-the-art methods on benchmark datasets.
无源领域自适应(SFDA)利用预先训练的源模型和未标注的目标数据。自我监督的SFDA技术通过预先训练的源模型生成伪标签,但由于源域与目标域之间的差异性,这些伪标签常常含有噪声。传统的自我监督SFDA方法依赖于使用softmax函数进行确定性的模型预测,这导致了不可靠的伪标签。在本工作中,我们提出利用证据深度学习引入预测不确定性并校准softmax以改进伪标签的质量。通过将Dirichlet先验置于目标网络输出之上,在单次前向传递中捕获不确定性和证据信息。此外,softmax校准解决了翻译不变性问题,有助于处理带有噪声标签的学习过程。在有和无先前知识的目标SFDA设置下,我们将结合使用证据深度学习损失函数与信息最大化损失函数及校准后的softmax。 广泛的实验分析表明,在基准数据集上,我们的方法优于其他最先进的技术。
https://arxiv.org/abs/2501.07072
This paper examines how linguistic similarity affects cross-lingual phonetic representation in speech processing for low-resource languages, emphasizing effective source language selection. Previous cross-lingual research has used various source languages to enhance performance for the target low-resource language without thorough consideration of selection. Our study stands out by providing an in-depth analysis of language selection, supported by a practical approach to assess phonetic proximity among multiple language families. We investigate how within-family similarity impacts performance in multilingual training, which aids in understanding language dynamics. We also evaluate the effect of using phonologically similar languages, regardless of family. For the phoneme recognition task, utilizing phonologically similar languages consistently achieves a relative improvement of 55.6% over monolingual training, even surpassing the performance of a large-scale self-supervised learning model. Multilingual training within the same language family demonstrates that higher phonological similarity enhances performance, while lower similarity results in degraded performance compared to monolingual training.
本文探讨了语言相似性如何影响低资源语言语音处理中的跨语言音系表示,重点强调有效的源语言选择。以往的跨语言研究使用各种源语言来提升目标低资源语言的表现,但在选择时并未充分考虑其效果。我们的研究通过提供对语言选择的深入分析而独树一帜,并提出了一种实用的方法来评估多种语言家族之间的语音相近程度。我们调查了同一家族内的相似性如何影响多语种训练的表现,这有助于理解语言动态。此外,我们也评估了使用音系相似语言(无论是否属于同一语言家族)的效果。 对于音素识别任务,使用音系上相似的语言在与单语模型相比时始终实现了55.6%的相对改进,甚至超越了大规模自监督学习模型的表现。在同一语言家族内的多语种训练表明,更高的音系相似性会提升性能,而较低的相似性则会导致比单语训练更差的结果。
https://arxiv.org/abs/2501.06810
Compositional image retrieval (CIR) is a multimodal learning task where a model combines a query image with a user-provided text modification to retrieve a target image. CIR finds applications in a variety of domains including product retrieval (e-commerce) and web search. Existing methods primarily focus on fully-supervised learning, wherein models are trained on datasets of labeled triplets such as FashionIQ and CIRR. This poses two significant challenges: (i) curating such triplet datasets is labor intensive; and (ii) models lack generalization to unseen objects and domains. In this work, we propose SCOT (Self-supervised COmpositional Training), a novel zero-shot compositional pretraining strategy that combines existing large image-text pair datasets with the generative capabilities of large language models to contrastively train an embedding composition network. Specifically, we show that the text embedding from a large-scale contrastively-pretrained vision-language model can be utilized as proxy target supervision during compositional pretraining, replacing the target image embedding. In zero-shot settings, this strategy surpasses SOTA zero-shot compositional retrieval methods as well as many fully-supervised methods on standard benchmarks such as FashionIQ and CIRR.
组成图像检索(CIR)是一种多模态学习任务,其中模型结合查询图像和用户提供的文本修改来检索目标图像。这种技术在产品检索(电子商务)、网页搜索等多个领域中找到应用。现有方法主要集中在完全监督的学习上,即模型是在标记的三元组数据集如FashionIQ和CIRR上进行训练的。这种方法带来了两个显著挑战:一是创建这样的三元组数据集需要大量的人力;二是模型在未见过的对象和域上的泛化能力较差。 在此工作中,我们提出了SCOT(自监督组成训练),这是一种新颖的零样本组成预训练策略,结合了现有的大规模图像-文本配对数据集与大型语言模型的生成能力,以对比方式训练嵌入组合网络。具体来说,我们证明了一个大规模对比预训练视觉-语言模型中的文本嵌入可以作为代理目标监督,在组成预训练期间替代目标图像嵌入。 在零样本设置下,这种方法不仅超越了现有的最先进的零样本组成检索方法,并且也优于许多完全监督的方法,尤其是在FashionIQ和CIRR等标准基准测试上。
https://arxiv.org/abs/2501.08347
This paper tackles the problem of generalizable 3D-aware generation from monocular datasets, e.g., ImageNet. The key challenge of this task is learning a robust 3D-aware representation without multi-view or dynamic data, while ensuring consistent texture and geometry across different viewpoints. Although some baseline methods are capable of 3D-aware generation, the quality of the generated images still lags behind state-of-the-art 2D generation approaches, which excel in producing high-quality, detailed images. To address this severe limitation, we propose a novel feed-forward pipeline based on pixel-aligned Gaussian Splatting, coined as F3D-Gaus, which can produce more realistic and reliable 3D renderings from monocular inputs. In addition, we introduce a self-supervised cycle-consistent constraint to enforce cross-view consistency in the learned 3D representation. This training strategy naturally allows aggregation of multiple aligned Gaussian primitives and significantly alleviates the interpolation limitations inherent in single-view pixel-aligned Gaussian Splatting. Furthermore, we incorporate video model priors to perform geometry-aware refinement, enhancing the generation of fine details in wide-viewpoint scenarios and improving the model's capability to capture intricate 3D textures. Extensive experiments demonstrate that our approach not only achieves high-quality, multi-view consistent 3D-aware generation from monocular datasets, but also significantly improves training and inference efficiency.
本文探讨了从单目数据集(如ImageNet)进行具有一般性的三维感知生成的问题。该任务的关键挑战在于学习一种无多视角或动态数据的稳健三维感知表示,同时确保不同视点间的纹理和几何一致性。尽管一些基准方法能够实现三维感知生成,但生成图像的质量仍然落后于擅长产生高质量、细节丰富的二维生成方法。为了解决这一严重的限制,我们提出了一种基于像素对齐高斯散布(Pixel-Aligned Gaussian Splatting)的新型前馈管道,命名为F3D-Gaus,可以从单目输入中生成更逼真和可靠的三维渲染图像。此外,我们引入了自监督循环一致约束来强制在学习到的三维表示中的跨视角一致性。这种训练策略自然地允许聚合多个对齐的高斯原语,并显著缓解了单一视图像素对齐高斯散布固有的插值限制。另外,我们将视频模型先验纳入其中以执行几何感知细化,在宽视角场景中增强了细节生成能力,并提升了模型捕捉复杂的三维纹理的能力。广泛的实验表明,我们的方法不仅实现了从单目数据集中进行高质量、多视角一致的三维感知生成,还显著提高了训练和推理效率。
https://arxiv.org/abs/2501.06714