Online object segmentation and tracking in Lidar point clouds enables autonomous agents to understand their surroundings and make safe decisions. Unfortunately, manual annotations for these tasks are prohibitively costly. We tackle this problem with the task of class-agnostic unsupervised online instance segmentation and tracking. To that end, we leverage an instance segmentation backbone and propose a new training recipe that enables the online tracking of objects. Our network is trained on pseudo-labels, eliminating the need for manual annotations. We conduct an evaluation using metrics adapted for temporal instance segmentation. Computing these metrics requires temporally-consistent instance labels. When unavailable, we construct these labels using the available 3D bounding boxes and semantic labels in the dataset. We compare our method against strong baselines and demonstrate its superiority across two different outdoor Lidar datasets.
在Lidar点云中进行在线物体分割和跟踪,使得自主智能体能够理解周围环境并做出安全的决策。然而,为这些任务手动注释成本过高。我们通过解决分类无关的无监督在线实例分割和跟踪问题来解决这个问题。为此,我们利用实例分割骨架并提出了一种新的训练方法,使物体能够进行在线跟踪。我们的网络在伪标签上进行训练,消除了手动注释的需求。我们使用适应时间实例分割的指标进行评估。计算这些指标需要时间一致的实例标签。当这些指标不可用时,我们使用数据集中的可用3D边界框和语义标签来构建这些标签。我们比较我们的方法与强大的基线方法,并展示了其在两个不同户外Lidar数据集上的优越性。
https://arxiv.org/abs/2409.07887
Retinal fundus photography offers a non-invasive way to diagnose and monitor a variety of retinal diseases, but is prone to inherent quality glitches arising from systemic imperfections or operator/patient-related factors. However, high-quality retinal images are crucial for carrying out accurate diagnoses and automated analyses. The fundus image enhancement is typically formulated as a distribution alignment problem, by finding a one-to-one mapping between a low-quality image and its high-quality counterpart. This paper proposes a context-informed optimal transport (OT) learning framework for tackling unpaired fundus image enhancement. In contrast to standard generative image enhancement methods, which struggle with handling contextual information (e.g., over-tampered local structures and unwanted artifacts), the proposed context-aware OT learning paradigm better preserves local structures and minimizes unwanted artifacts. Leveraging deep contextual features, we derive the proposed context-aware OT using the earth mover's distance and show that the proposed context-OT has a solid theoretical guarantee. Experimental results on a large-scale dataset demonstrate the superiority of the proposed method over several state-of-the-art supervised and unsupervised methods in terms of signal-to-noise ratio, structural similarity index, as well as two downstream tasks. The code is available at \url{this https URL}.
眼底摄影是一种非侵入性的方法,用于诊断和监测各种眼底疾病,但容易受到系统不完善或操作者/患者相关因素导致的固有质量问题。然而,高质量的眼底图像对于进行准确诊断和自动化分析至关重要。眼底图像增强通常被视为分布对齐问题,通过找到低质量图像和高质量图像之间一对一的映射。本文提出了一种基于上下文的有条件最优传输(OT)学习框架来解决无配对眼底图像增强问题。与标准生成图像增强方法不同,该方法在处理上下文信息(例如过度处理局部结构和不需要的伪影)方面存在困难。通过利用深层次的上下文特征,我们使用地球平移距离来表示所提出的上下文有条件OT,并证明了所提出的上下文OT具有 solid theoretical guarantee。在大型数据集上的实验结果表明,与几个最先进的监督和无监督方法相比,所提出的方法在信噪比、结构相似性指数以及两个下游任务方面具有优越性。代码可在此处访问:\url{this <https:// this URL> }。
https://arxiv.org/abs/2409.07862
For traffic incident detection, the acquisition of data and labels is notably resource-intensive, rendering semi-supervised traffic incident detection both a formidable and consequential challenge. Thus, this paper focuses on traffic incident detection with a semi-supervised learning way. It proposes a semi-supervised learning model named FPMT within the framework of MixText. The data augmentation module introduces Generative Adversarial Networks to balance and expand the dataset. During the mix-up process in the hidden space, it employs a probabilistic pseudo-mixing mechanism to enhance regularization and elevate model precision. In terms of training strategy, it initiates with unsupervised training on all data, followed by supervised fine-tuning on a subset of labeled data, and ultimately completing the goal of semi-supervised training. Through empirical validation on four authentic datasets, our FPMT model exhibits outstanding performance across various metrics. Particularly noteworthy is its robust performance even in scenarios with low label rates.
对于交通事件检测,数据和标签的获取是相当资源密集的,这使得半监督交通事件检测成为一项具有挑战性和严重后果的艰巨任务。因此,本文重点关注使用半监督学习方法进行交通事件检测。它提出了一种名为FPMT的半监督学习模型,该模型基于MixText框架。数据增强模块引入了生成对抗网络来平衡和扩展数据集。在隐藏空间中的混合过程中,它采用了一种概率伪混合机制来增强正则化和提高模型精度。在训练策略方面,它从所有数据开始进行无监督训练,然后对一小部分带有标签的数据进行有监督微调,最后完成半监督训练目标。通过在四个真实数据集上的实证验证,我们的FPMT模型在各种指标上表现突出。尤其值得一提的是,即使在低标签率场景中,它也表现出极高的鲁棒性。
https://arxiv.org/abs/2409.07839
Surgical scenes convey crucial information about the quality of surgery. Pixel-wise localization of tools and anatomical structures is the first task towards deeper surgical analysis for microscopic or endoscopic surgical views. This is typically done via fully-supervised methods which are annotation greedy and in several cases, demanding medical expertise. Considering the profusion of surgical videos obtained through standardized surgical workflows, we propose an annotation-efficient framework for the semantic segmentation of surgical scenes. We employ image-based self-supervised object discovery to identify the most salient tools and anatomical structures in surgical videos. These proposals are further refined within a minimally supervised fine-tuning step. Our unsupervised setup reinforced with only 36 annotation labels indicates comparable localization performance with fully-supervised segmentation models. Further, leveraging surgical phase labels as weak labels can better guide model attention towards surgical tools, leading to $\sim 2\%$ improvement in tool localization. Extensive ablation studies on the CaDIS dataset validate the effectiveness of our proposed solution in discovering relevant surgical objects with minimal or no supervision.
手术场景传达了手术质量的关键信息。像素级定位工具和解剖结构是显微镜或内窥镜手术视图的深入手术分析的第一步。这通常通过完全监督的方法来完成,这些方法是注释贪婪的,在某些情况下,需要医学专业知识。考虑到通过标准化手术工作流程获得的手术视频的丰富性,我们提出了一个注释效率的框架,用于对手术场景进行语义分割。我们利用基于图像的自监督物体发现来识别手术视频中最显眼的工具和解剖结构。这些建议在最小监督的精细化调整步骤中进一步细化。仅使用36个注释标签的无忧设置表明,与完全监督分割模型相当,具有类似的定位性能。此外,将手术阶段标签作为弱标签可以更好地引导模型注意,从而将工具定位精度提高约2%。对CADIS数据集的广泛消融研究证实了我们在发现相关手术对象时,无需或仅需监督的有效性。
https://arxiv.org/abs/2409.07801
Depth estimation is a cornerstone of 3D reconstruction and plays a vital role in minimally invasive endoscopic surgeries. However, most current depth estimation networks rely on traditional convolutional neural networks, which are limited in their ability to capture global information. Foundation models offer a promising avenue for enhancing depth estimation, but those currently available are primarily trained on natural images, leading to suboptimal performance when applied to endoscopic images. In this work, we introduce a novel fine-tuning strategy for the Depth Anything Model and integrate it with an intrinsic-based unsupervised monocular depth estimation framework. Our approach includes a low-rank adaptation technique based on random vectors, which improves the model's adaptability to different scales. Additionally, we propose a residual block built on depthwise separable convolution to compensate for the transformer's limited ability to capture high-frequency details, such as edges and textures. Our experimental results on the SCARED dataset show that our method achieves state-of-the-art performance while minimizing the number of trainable parameters. Applying this method in minimally invasive endoscopic surgery could significantly enhance both the precision and safety of these procedures.
深度估计是3D建模的基础,对于微创内窥镜手术至关重要。然而,目前大多数深度估计网络依赖于传统的卷积神经网络,这些网络在捕捉全局信息方面有限。基础模型提供了一种增强深度估计的有趣途径,但是目前可用的模型主要在自然图像上训练,因此当应用于内窥镜图像时,性能会降低。在这项工作中,我们引入了一种名为Depth Anything模型的全新微调策略,并将其与基于自监督的单目深度估计框架相结合。我们的方法包括基于随机向量的低秩适应技术,该技术提高了模型的适应性。此外,我们提出了一种基于深度可分离卷积的残差块,以弥补Transformer在捕捉高频细节(如边缘和纹理)方面的局限性。在SCARED数据集上的实验结果表明,我们的方法在保持训练参数数量的同时实现了最先进的性能。将这种方法应用于微创内窥镜手术可以显著增强这些程序的精度和安全性。
https://arxiv.org/abs/2409.07723
Dialogue topic segmentation plays a crucial role in various types of dialogue modeling tasks. The state-of-the-art unsupervised DTS methods learn topic-aware discourse representations from conversation data through adjacent discourse matching and pseudo segmentation to further mine useful clues in unlabeled conversational relations. However, in multi-round dialogs, discourses often have co-references or omissions, leading to the fact that direct use of these discourses for representation learning may negatively affect the semantic similarity computation in the neighboring discourse matching task. In order to fully utilize the useful cues in conversational relations, this study proposes a novel unsupervised dialog topic segmentation method that combines the Utterance Rewriting (UR) technique with an unsupervised learning algorithm to efficiently utilize the useful cues in unlabeled dialogs by rewriting the dialogs in order to recover the co-referents and omitted words. Compared with existing unsupervised models, the proposed Discourse Rewriting Topic Segmentation Model (UR-DTS) significantly improves the accuracy of topic segmentation. The main finding is that the performance on DialSeg711 improves by about 6% in terms of absolute error score and WD, achieving 11.42% in terms of absolute error score and 12.97% in terms of WD. on Doc2Dial the absolute error score and WD improves by about 3% and 2%, respectively, resulting in SOTA reaching 35.17% in terms of absolute error score and 38.49% in terms of WD. This shows that the model is very effective in capturing the nuances of conversational topics, as well as the usefulness and challenges of utilizing unlabeled conversations.
对话主题分割在各种对话建模任务中起着关键作用。最先进的无监督DTS方法通过相邻对话匹配和伪主题分割从对话数据中学习主题相关的会话表示,进一步挖掘未标记对话关系中的有用线索。然而,在多轮对话中,对话通常存在共同参考或省略,导致直接使用这些对话进行表示学习可能会对相邻对话匹配任务的语义相似性计算产生负面影响。为了充分利用对话关系中的有用线索,本研究提出了一个新颖的无监督对话主题分割方法,将Utterance Rewriting(UR)技术无监督学习算法相结合,通过重写对话来恢复共同参考和省略的字词,以有效地利用未标记对话中的有用线索。与现有无监督模型相比,所提出的UR-DTS模型在主题分割准确性方面显著改进。主要发现是,在DialSeg711上的性能提高了约6%,在WD方面提高了约11.42%,达到12.97%的绝对误差分数。在Doc2Dial上,绝对误差分数和WD分别提高了约3%和2%,达到SOTA的绝对误差分数为35.17%,WD为38.49%。这说明该模型在捕捉会话主题的细微差别以及利用未标记对话的有用性和挑战方面非常有效。
https://arxiv.org/abs/2409.07672
With the advent of billion-parameter foundation models, efficient fine-tuning has become increasingly important for the adaptation of models to downstream tasks. However, especially in computer vision, it can be hard to achieve good performance when access to quality labeled data is lacking. In this work, we propose a method adapting pretrained generalist models in a self-supervised manner by learning binary masks. These self-supervised masking networks (SMNs) are up to 79x more efficient to store and significantly improve performance on label-efficient downstream tasks. We validate the usefulness of learning binary masks as a fine-tuning method on 8 datasets and 3 model architectures, and we demonstrate the effectiveness of SMNs in 3 label-efficient settings.
随着亿参数基础模型的出现,在将模型应用于下游任务时,高效的微调变得越来越重要。然而,尤其是在计算机视觉领域,当访问到高质量标注数据时,实现良好的性能可能很难。在这项工作中,我们提出了一种通过学习二进制掩码将预训练泛化模型自我监督的方法。这些自监督掩码网络(SMNs)是存储效率高达79倍的更有效的模型,同时在标签效率下游任务上显著提高性能。我们在8个数据集和3种模型架构上验证了学习二进制掩码作为微调方法的有效性,并在3个标签效率设置中展示了SMNs的有效性。
https://arxiv.org/abs/2409.07577
Rigid point cloud registration is a fundamental problem and highly relevant in robotics and autonomous driving. Nowadays deep learning methods can be trained to match a pair of point clouds, given the transformation between them. However, this training is often not scalable due to the high cost of collecting ground truth poses. Therefore, we present a self-distillation approach to learn point cloud registration in an unsupervised fashion. Here, each sample is passed to a teacher network and an augmented view is passed to a student network. The teacher includes a trainable feature extractor and a learning-free robust solver such as RANSAC. The solver forces consistency among correspondences and optimizes for the unsupervised inlier ratio, eliminating the need for ground truth labels. Our approach simplifies the training procedure by removing the need for initial hand-crafted features or consecutive point cloud frames as seen in related methods. We show that our method not only surpasses them on the RGB-D benchmark 3DMatch but also generalizes well to automotive radar, where classical features adopted by others fail. The code is available at this https URL .
刚性点云配准是一个基本问题,在机器人学和自动驾驶中具有很高的相关性。如今,深度学习方法可以训练以匹配它们之间的点云变换。然而,由于收集地面真实姿势的高成本,这种训练往往不可扩展。因此,我们提出了一个自蒸馏方法,以在无需标注的情况下学习点云配准。在这里,每个样本都被传递给一个教师网络和一个增强网络。教师包括一个可训练的特征提取器和一种无需训练的学习免费的鲁棒求解器(如RANSAC)。求解器强制匹配之间的对应关系并优化无监督的异质比,消除了需要地面真实标签的需求。通过消除在相关方法中需要初始手动构建的特征或连续点云帧的需求,我们的方法简化了训练程序。我们在3DMatch基准上证明了我们的方法不仅超越了它们,而且对汽车雷达等应用也表现良好。代码可在此处访问:https:// this URL.
https://arxiv.org/abs/2409.07558
Self-training methods have proven to be effective in exploiting abundant unlabeled data in semi-supervised learning, particularly when labeled data is scarce. While many of these approaches rely on a cross-entropy loss function (CE), recent advances have shown that the supervised contrastive loss function (SupCon) can be more effective. Additionally, unsupervised contrastive learning approaches have also been shown to capture high quality data representations in the unsupervised setting. To benefit from these advantages in a semi-supervised setting, we propose a general framework to enhance self-training methods, which replaces all instances of CE losses with a unique contrastive loss. By using class prototypes, which are a set of class-wise trainable parameters, we recover the probability distributions of the CE setting and show a theoretical equivalence with it. Our framework, when applied to popular self-training methods, results in significant performance improvements across three different datasets with a limited number of labeled data. Additionally, we demonstrate further improvements in convergence speed, transfer ability, and hyperparameter stability. The code is available at \url{this https URL}.
自我训练方法在半监督学习中被广泛发掘,尤其是在标签数据稀缺的情况下。虽然许多方法依赖于交叉熵损失函数(CE),但最近的研究表明,有监督对比损失函数(SupCon)可能更有效。此外,无监督对比学习方法也已经证明了在无监督设置中捕捉高质量数据表示的潜力。为了在半监督环境中利用这些优势,我们提出了一个通用的框架来增强自我训练方法,用独特的对比损失替换所有CE损失实例。通过使用类原型,这是类级的可训练参数集合,我们恢复了CE设置的概率分布,并证明了它与它具有理论上的等价性。将我们的框架应用于流行的自我训练方法时,在有限标注数据的情况下,三个不同数据集的性能都有显著的提高。此外,我们还证明了进一步的收敛速度、传输能力和超参数稳定性改进。代码可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2409.07292
Recent progress of semantic point clouds analysis is largely driven by synthetic data (e.g., the ModelNet and the ShapeNet), which are typically complete, well-aligned and noisy free. Therefore, representations of those ideal synthetic point clouds have limited variations in the geometric perspective and can gain good performance on a number of 3D vision tasks such as point cloud classification. In the context of unsupervised domain adaptation (UDA), representation learning designed for synthetic point clouds can hardly capture domain invariant geometric patterns from incomplete and noisy point clouds. To address such a problem, we introduce a novel scheme for induced geometric invariance of point cloud representations across domains, via regularizing representation learning with two self-supervised geometric augmentation tasks. On one hand, a novel pretext task of predicting translation distances of augmented samples is proposed to alleviate centroid shift of point clouds due to occlusion and noises. On the other hand, we pioneer an integration of the relational self-supervised learning on geometrically-augmented point clouds in a cascade manner, utilizing the intrinsic relationship of augmented variants and other samples as extra constraints of cross-domain geometric features. Experiments on the PointDA-10 dataset demonstrate the effectiveness of the proposed method, achieving the state-of-the-art performance.
近年来,语义点云分析的进展主要是由合成数据(例如ModelNet和ShapeNet)驱动的,这些数据通常是完整、对齐良好的且有噪声的。因此,这些理想合成点云的表示在几何视角上具有有限的变异性,并且在点云分类等3D视觉任务上表现良好。在无监督领域自适应(UDA)背景下,为合成点云设计的表示学习很难捕捉来自不完整和有噪声的点云的领域不变的几何模式。为解决这个问题,我们引入了一种名为跨域几何变换的点云表示生成新方法,通过两个自监督几何增强任务对表示学习进行正则化。一方面,我们提出了一个预测增强样本平移距离的新预任务,用于减轻由于遮挡和噪声引起的点云的质心偏移。另一方面,我们在级联方式下,将关系自监督学习集成到几何增强点云中,利用增强变体的内在关系和其它样本作为跨域几何特征的额外约束。在PointDA-10数据集上的实验证明,所提出的方法的有效性得到了充分体现,并达到了最先进的水平。
https://arxiv.org/abs/2409.06956
Early detection of eye diseases like glaucoma, macular degeneration, and diabetic retinopathy is crucial for preventing vision loss. While artificial intelligence (AI) foundation models hold significant promise for addressing these challenges, existing ophthalmic foundation models primarily focus on a single modality, whereas diagnosing eye diseases requires multiple modalities. A critical yet often overlooked aspect is harnessing the multi-view information across various modalities for the same patient. Additionally, due to the long-tail nature of ophthalmic diseases, standard fully supervised or unsupervised learning approaches often struggle. Therefore, it is essential to integrate clinical text to capture a broader spectrum of diseases. We propose EyeCLIP, a visual-language foundation model developed using over 2.77 million multi-modal ophthalmology images with partial text data. To fully leverage the large multi-modal unlabeled and labeled data, we introduced a pretraining strategy that combines self-supervised reconstructions, multi-modal image contrastive learning, and image-text contrastive learning to learn a shared representation of multiple modalities. Through evaluation using 14 benchmark datasets, EyeCLIP can be transferred to a wide range of downstream tasks involving ocular and systemic diseases, achieving state-of-the-art performance in disease classification, visual question answering, and cross-modal retrieval. EyeCLIP represents a significant advancement over previous methods, especially showcasing few-shot, even zero-shot capabilities in real-world long-tail scenarios.
早于发现像糖尿病视网膜病、黄斑变性和其他眼病等严重眼病对防止视力下降至关重要。虽然人工智能(AI)基础模型对于应对这些挑战具有很大的潜力,但现有的眼部基础模型主要集中在单一模式,而诊断眼病需要多个模式。关键但往往被忽视的一个方面是利用各种模式的多视角信息为同一患者提供信息。此外,由于眼病数据的非标准化和数据稀疏性,标准的完全监督或无监督学习方法通常很难。因此,将临床文本集成以捕捉更广泛的疾病范围至关重要。我们提出了EyeCLIP,一种使用超过2770万张多模态眼病图像以及部分文本数据开发的视觉语言基础模型。为了充分利用大量多模态未标记和标记数据,我们引入了一种预训练策略,将自监督重构、多模态图像对比学习和图像-文本对比学习相结合以学习多个模式的共享表示。通过使用14个基准数据集进行评估,EyeCLIP可以转移到涉及眼部和系统疾病的各种下游任务,并在疾病分类、视觉问答和跨模态检索方面实现最先进的性能。EyeCLIP在以前的方法上取得了显著的进步,尤其是在现实世界中的长尾场景中表现出几 shot甚至零 shot的能力。
https://arxiv.org/abs/2409.06644
Image analysis in the euclidean space through linear hyperspaces is well studied. However, in the quest for more effective image representations, we turn to hyperbolic manifolds. They provide a compelling alternative to capture complex hierarchical relationships in images with remarkably small dimensionality. To demonstrate hyperbolic embeddings' competence, we introduce a light-weight hyperbolic graph neural network for image segmentation, encompassing patch-level features in a very small embedding size. Our solution, Seg-HGNN, surpasses the current best unsupervised method by 2.5\%, 4\% on VOC-07, VOC-12 for localization, and by 0.8\%, 1.3\% on CUB-200, ECSSD for segmentation, respectively. With less than 7.5k trainable parameters, Seg-HGNN delivers effective and fast ($\approx 2$ images/second) results on very standard GPUs like the GTX1650. This empirical evaluation presents compelling evidence of the efficacy and potential of hyperbolic representations for vision tasks.
在欧氏空间中进行图像分析通过线性超空间已经很成熟了。然而,在寻求更有效的图像表示时,我们转向了双曲几何映射。它们为捕捉图像中复杂层次关系提供了引人注目的替代方法,具有显著的低维度。为了证明双曲嵌入的效能,我们引入了一个轻量级的双曲图形神经网络用于图像分割,在非常小的嵌入尺寸中包括局部特征。我们的解决方案Seg-HGNN在VOC-07、VOC-12的局部定位上比当前最好的无监督方法分别提高了2.5\%和4\%,在CUB-200、ECSSD上提高了0.8\%和1.3\%。在仅有7.5k个可训练参数的情况下,Seg-HGNN在像GTX1650这样的标准GPU上实现了有效且快速的(每秒约2张)结果。这一实证评估证明了双曲表示在视觉任务中的有效性和潜在可能性。
https://arxiv.org/abs/2409.06589
Anomaly detection is a crucial process in industrial manufacturing and has made significant advancements recently. However, there is a large variance between the data used in the development and the data collected by the production environment. Therefore, we present the Texture-AD benchmark based on representative texture-based anomaly detection to evaluate the effectiveness of unsupervised anomaly detection algorithms in real-world applications. This dataset includes images of 15 different cloth, 14 semiconductor wafers and 10 metal plates acquired under different optical schemes. In addition, it includes more than 10 different types of defects produced during real manufacturing processes, such as scratches, wrinkles, color variations and point defects, which are often more difficult to detect than existing datasets. All anomalous areas are provided with pixel-level annotations to facilitate comprehensive evaluation using anomaly detection models. Specifically, to adapt to diverse products in automated pipelines, we present a new evaluation method and results of baseline algorithms. The experimental results show that Texture-AD is a difficult challenge for state-of-the-art algorithms. To our knowledge, Texture-AD is the first dataset to be devoted to evaluating industrial defect detection algorithms in the real world. The dataset is available at https://XXX.
异常检测在工业生产中是一个关键的过程,并且最近取得了显著的进步。然而,用于开发和生产环境的数据之间存在较大差异。因此,我们基于代表的纹理基于异常检测的数据集Texture-AD,评估了在现实应用中无监督异常检测算法的有效性。这个数据集包括15种不同布料、14种半导体晶圆和10种金属板的图像,这些图像是在不同的光学方案下获得的。此外,它还包括在实际制造过程中产生的超过10种不同类型的缺陷,如划痕、皱纹、颜色变化和点缺陷,这些缺陷通常比现有的数据集更难检测。所有异常区域都提供了像素级的注释,以方便使用异常检测模型进行全面的评估。 具体来说,为了适应自动流水线上的不同产品,我们提出了一个新的评估方法和基线算法的结果。实验结果表明,Texture-AD对于最先进的算法来说是一个具有挑战性的目标。据我们所知,Texture-AD是第一个致力于评估现实世界工业缺陷检测算法的数据集。这个数据集可以在https://XXX上获得。
https://arxiv.org/abs/2409.06367
Heterogeneous graph neural networks (HGNNs) have significantly propelled the information retrieval (IR) field. Still, the effectiveness of HGNNs heavily relies on high-quality labels, which are often expensive to acquire. This challenge has shifted attention towards Heterogeneous Graph Contrastive Learning (HGCL), which usually requires pre-defined meta-paths. However, our findings reveal that meta-path combinations significantly affect performance in unsupervised settings, an aspect often overlooked in current literature. Existing HGCL methods have considerable variability in outcomes across different meta-path combinations, thereby challenging the optimization process to achieve consistent and high performance. In response, we introduce \textsf{LAMP} (\underline{\textbf{L}}earn\underline{\textbf{A}}ble \underline{\textbf{M}}eta-\underline{\textbf{P}}ath), a novel adversarial contrastive learning approach that integrates various meta-path sub-graphs into a unified and stable structure, leveraging the overlap among these sub-graphs. To address the denseness of this integrated sub-graph, we propose an adversarial training strategy for edge pruning, maintaining sparsity to enhance model performance and robustness. \textsf{LAMP} aims to maximize the difference between meta-path and network schema views for guiding contrastive learning to capture the most meaningful information. Our extensive experimental study conducted on four diverse datasets from the Heterogeneous Graph Benchmark (HGB) demonstrates that \textsf{LAMP} significantly outperforms existing state-of-the-art unsupervised models in terms of accuracy and robustness.
异质图神经网络(HGNNs)在信息检索(IR)领域取得了显著的推动。然而,HGNNs的有效性很大程度上依赖于高质量的标签,这些标签通常很难获得。这一挑战使关注点转向异质图对比学习(HGCL),通常需要预定义的元路径。然而,我们的研究结果表明,元路径组合在无监督环境中对性能具有显著影响,这是当前文献中经常被忽视的一个方面。现有的HGCL方法在不同的元路径组合上具有相当大的性能差异,从而挑战了优化过程以实现一致且高性能。为了应对这种集成的子图的密度,我们引入了\textsf{LAMP}( Learnable Adversarial Meta-Path)作为新颖的对抗性对比学习方法,将各种元路径子图集成到一个统一且稳定的结构中,利用这些子图之间的重叠。为了应对集成的子图的密度,我们提出了边缘修剪的对抗性训练策略,保持稀疏性以提高模型性能和鲁棒性。\textsf{LAMP}旨在最大化元路径与网络模式视图之间的差异,引导对比学习捕捉最 meaningful信息。我们对四个异质图基准数据集(HGB)的广泛实验研究证明了\textsf{LAMP}在准确性和鲁棒性方面显著优于现有的无监督模型。
https://arxiv.org/abs/2409.06323
Unsupervised anomaly detection (AD) aims to train robust detection models using only normal samples, while can generalize well to unseen anomalies. Recent research focuses on a unified unsupervised AD setting in which only one model is trained for all classes, i.e., n-class-one-model paradigm. Feature-reconstruction-based methods achieve state-of-the-art performance in this scenario. However, existing methods often suffer from a lack of sufficient contextual awareness, thereby compromising the quality of the reconstruction. To address this issue, we introduce a novel Reconstruction as Sequence (RAS) method, which enhances the contextual correspondence during feature reconstruction from a sequence modeling perspective. In particular, based on the transformer technique, we integrate a specialized RASFormer block into RAS. This block enables the capture of spatial relationships among different image regions and enhances sequential dependencies throughout the reconstruction process. By incorporating the RASFormer block, our RAS method achieves superior contextual awareness capabilities, leading to remarkable performance. Experimental results show that our RAS significantly outperforms competing methods, well demonstrating the effectiveness and superiority of our method. Our code is available at this https URL.
无监督异常检测(AD)旨在使用仅正常的样本来训练鲁棒检测模型,同时对未见过的异常具有良好的泛化能力。最近的研究集中在统一的无监督AD场景中,即n-类一模型范式。基于特征重构的方法在这个场景下实现了最先进的性能。然而,现有的方法通常缺乏足够的上下文意识,从而降低了重构的质量。为解决这个问题,我们引入了一种名为序列重构(RAS)的新方法,该方法从序列建模的角度增强了特征重构过程中的上下文联系。特别地,基于Transformer技术,我们将一个特殊的RASFormer模块集成到RAS中。这个模块能够捕捉不同图像区域之间的空间关系,并增强重构过程中的顺序依赖关系。通过包括RASFormer模块,我们的RAS方法实现了卓越的上下文意识能力,从而取得了显著的性能。实验结果表明,我们的RAS显著优于竞争方法,充分证明了我们的方法的有效性和优越性。您可以在此链接查看我们的代码。
https://arxiv.org/abs/2409.06285
Music timbre transfer is a challenging task that involves modifying the timbral characteristics of an audio signal while preserving its melodic structure. In this paper, we propose a novel method based on dual diffusion bridges, trained using the CocoChorales Dataset, which consists of unpaired monophonic single-instrument audio data. Each diffusion model is trained on a specific instrument with a Gaussian prior. During inference, a model is designated as the source model to map the input audio to its corresponding Gaussian prior, and another model is designated as the target model to reconstruct the target audio from this Gaussian prior, thereby facilitating timbre transfer. We compare our approach against existing unsupervised timbre transfer models such as VAEGAN and Gaussian Flow Bridges (GFB). Experimental results demonstrate that our method achieves both better Fréchet Audio Distance (FAD) and melody preservation, as reflected by lower pitch distances (DPD) compared to VAEGAN and GFB. Additionally, we discover that the noise level from the Gaussian prior, $\sigma$, can be adjusted to control the degree of melody preservation and amount of timbre transferred.
音乐时域转移是一个具有挑战性的任务,它涉及修改音频信号的时域特性,同时保留其旋律结构。在本文中,我们提出了一种基于双扩散桥的新方法,通过使用CocoChorales数据集进行训练,该方法包括未配对的多声道单乐器音频数据。每个扩散模型都在特定乐器上进行训练,具有高斯先验。在推理过程中,指定一个模型为源模型,将其输入音频映射到相应的高斯先验,另一个模型为目标模型,从高斯先验中重构目标音频,从而促进时域转移。我们比较我们的方法与现有的无监督时域转移方法(如VAEGAN和Gaussian Flow Bridges)的结果。实验结果表明,我们的方法实现了更好的弗雷切音频距离(FAD)和旋律保留,正如低音距离(DPD)相比VAEGAN和GFB较低所示。此外,我们还发现,高斯先验的噪声水平($\sigma$)可以调整以控制旋律保留程度和时域转移量。
https://arxiv.org/abs/2409.06096
Recent advances in language modelling has significantly decreased the need of labelled data in text classification tasks. Transformer-based models, pre-trained on unlabeled data, can outmatch the performance of models trained from scratch for each task. However, the amount of labelled data need to fine-tune such type of model is still considerably high for domains requiring expert-level annotators, like the legal domain. This paper investigates the best strategies for optimizing the use of a small labeled dataset and large amounts of unlabeled data and perform a classification task in the legal area with 50 predefined topics. More specifically, we use the records of demands to a Brazilian Public Prosecutor's Office aiming to assign the descriptions in one of the subjects, which currently demands deep legal knowledge for manual filling. The task of optimizing the performance of classifiers in this scenario is especially challenging, given the low amount of resources available regarding the Portuguese language, especially in the legal domain. Our results demonstrate that classic supervised models such as logistic regression and SVM and the ensembles random forest and gradient boosting achieve better performance along with embeddings extracted with word2vec when compared to BERT language model. The latter demonstrates superior performance in association with the architecture of the model itself as a classifier, having surpassed all previous models in that regard. The best result was obtained with Unsupervised Data Augmentation (UDA), which jointly uses BERT, data augmentation, and strategies of semi-supervised learning, with an accuracy of 80.7% in the aforementioned task.
近年来在语言建模方面的进步显著地降低了在文本分类任务中需要标记数据的需求。基于Transformer的模型,从无标签数据中预训练,可以超越从零开始训练的模型的性能。然而,对于需要专家水平注释的领域,如法律领域,标记数据量仍然相当高。本文研究了优化小标注数据集和大规模无标签数据集的使用以及在一个具有50个预定义主题的法学领域上进行分类任务的最佳策略,并使用巴西公共检察官办公室要求记录的请求来研究其中的一个主题,该主题目前需要深入了解手动填写。在这种情况下,优化分类器的性能尤其具有挑战性,因为关于葡萄牙语的资源非常有限,尤其是在法律领域。我们的结果表明,与BERT语言模型一起使用词向量提取的嵌入,经典监督模型(如逻辑回归和SVM)和随机森林和梯度提升的集成可以达到与BERT语言模型更好的性能。后者在分类器方面表现出优越的性能,超越了该领域的所有以前的模型。最佳结果是在Unsupervised Data Augmentation(UDA)上获得的,它与BERT、数据增强和半监督学习的策略共同使用,该任务的准确率为80.7%。
https://arxiv.org/abs/2409.05972
Emerging universal Computational Aberration Correction (CAC) paradigms provide an inspiring solution to light-weight and high-quality imaging without repeated data preparation and model training to accommodate new lens designs. However, the training databases in these approaches, i.e., the lens libraries (LensLibs), suffer from their limited coverage of real-world aberration behaviors. In this work, we set up an OmniLens framework for universal CAC, considering both the generalization ability and flexibility. OmniLens extends the idea of universal CAC to a broader concept, where a base model is trained for three cases, including zero-shot CAC with the pre-trained model, few-shot CAC with a little lens-specific data for fine-tuning, and domain adaptive CAC using domain adaptation for lens-descriptions-unknown lens. In terms of OmniLens's data foundation, we first propose an Evolution-based Automatic Optical Design (EAOD) pipeline to construct LensLib automatically, coined AODLib, whose diversity is enriched by an evolution framework, with comprehensive constraints and a hybrid optimization strategy for achieving realistic aberration behaviors. For network design, we introduce the guidance of high-quality codebook priors to facilitate zero-shot CAC and few-shot CAC, which enhances the model's generalization ability, while also boosting its convergence in a few-shot case. Furthermore, based on the statistical observation of dark channel priors in optical degradation, we design an unsupervised regularization term to adapt the base model to the target descriptions-unknown lens using its aberration images without ground truth. We validate OmniLens on 4 manually designed low-end lenses with various structures and aberration behaviors. Remarkably, the base model trained on AODLib exhibits strong generalization capabilities, achieving 97% of the lens-specific performance in a zero-shot setting.
新兴的通用计算校正(CAC)范式为轻量和高质量的图像提供了鼓舞人心的解决方案,而无需重复数据准备和模型训练以适应新的透镜设计。然而,这些方法中的训练数据库,即透镜库(LensLibs),由于其对真实世界色差行为的覆盖范围有限,而遭受了挫折。在这项工作中,我们建立了一个通用计算校正(CAC)框架,称为OmniLens,考虑了通用性和灵活性。OmniLens将通用CAC的想法扩展到一个更广泛的范畴,其中基础模型针对三种情况训练,包括预训练模型下的零散CAC、仅使用少量透镜特定数据进行微调的少散CAC和利用领域自适应进行透镜描述未知透镜的领域自适应CAC。在OmniLens的数据基础方面,我们首先提出了一个基于进化的自动光学设计(AOD)管道来构建透镜库(LensLibs),其多样性通过进化框架得到了增强,具有全面的约束和一种混合优化策略来实现真实的色差行为。在网络设计方面,我们引入了高质量代码书 prior的指导以促进零散CAC和少散CAC,从而增强模型的泛化能力,同时在几散情况下提高模型的收敛速度。此外,基于光学降解中暗通道 prior的统计观察,我们设计了一个无监督的 Regularization 项,使其自适应基础模型,通过透镜图像在没有地面真实现状色差行为。我们在4个手动设计的低端透镜上验证了OmniLens。令人惊讶的是,基于AODLib训练的基础模型表现出很强的泛化能力,在零散设置中实现了97%的透镜特定性能。
https://arxiv.org/abs/2409.05809
To extract robust and generalizable skeleton action recognition features, large amounts of well-curated data are typically required, which is a challenging task hindered by annotation and computation costs. Therefore, unsupervised representation learning is of prime importance to leverage unlabeled skeleton data. In this work, we investigate unsupervised representation learning for skeleton action recognition. For this purpose, we designed a lightweight convolutional transformer framework, named ReL-SAR, exploiting the complementarity of convolutional and attention layers for jointly modeling spatial and temporal cues in skeleton sequences. We also use a Selection-Permutation strategy for skeleton joints to ensure more informative descriptions from skeletal data. Finally, we capitalize on Bootstrap Your Own Latent (BYOL) to learn robust representations from unlabeled skeleton sequence data. We achieved very competitive results on limited-size datasets: MCAD, IXMAS, JHMDB, and NW-UCLA, showing the effectiveness of our proposed method against state-of-the-art methods in terms of both performance and computational efficiency. To ensure reproducibility and reusability, the source code including all implementation parameters is provided at: this https URL
为了提取稳健且具有泛化性的骨骼动作识别特征,通常需要大量经过良好筛选的标注数据,这会因标注和计算成本而具有挑战性。因此,无监督表示学习在利用无标签骨骼数据方面具有至关重要的作用。在这项工作中,我们研究了无监督表示学习在骨骼动作识别中的应用。为此,我们设计了一个轻量级的卷积变换框架,名为ReL-SAR,利用卷积层和注意层的互补性来联合建模骨骼序列中的空间和时间线索。我们还使用选择-置换策略来对骨骼关节进行建模,以确保从骨骼数据中获得更丰富的描述。最后,我们利用Bootstrap Your Own Latent(BYOL)从无标签骨骼序列数据中学习稳健的表示。我们在有限规模的数据集上取得了非常出色的结果:MCAD、IXMAS、JHMDB和NW-UCLA,这表明我们所提出的方法在性能和计算效率方面与最先进的方法相当。为了确保可重复性和可重复使用性,我们提供了源代码和所有实现参数的链接:https:// this URL
https://arxiv.org/abs/2409.05749
We introduce Segmentation by Factorization (F-SEG), an unsupervised segmentation method for pathology that generates segmentation masks from pre-trained deep learning models. F-SEG allows the use of pre-trained deep neural networks, including recently developed pathology foundation models, for semantic segmentation. It achieves this without requiring additional training or finetuning, by factorizing the spatial features extracted by the models into segmentation masks and their associated concept features. We create generic tissue phenotypes for H&E images by training clustering models for multiple numbers of clusters on features extracted from several deep learning models on The Cancer Genome Atlas Program (TCGA), and then show how the clusters can be used for factorizing corresponding segmentation masks using off-the-shelf deep learning models. Our results show that F-SEG provides robust unsupervised segmentation capabilities for H&E pathology images, and that the segmentation quality is greatly improved by utilizing pathology foundation models. We discuss and propose methods for evaluating the performance of unsupervised segmentation in pathology.
我们提出了Segmentation by Factorization (F-SEG)这一无监督分割方法,用于病理学,它可以从预训练的深度学习模型中生成分割掩码。F-SEG允许使用预训练的深度神经网络,包括最近开发的病理学基础模型,进行语义分割。它通过将模型提取的 spatial features 进行分割掩码及其相关概念特征的因子化来实现这一目标,而无需进行额外的训练或微调。我们通过在The Cancer Genome Atlas Program (TCGA)上训练聚类模型,为H&E图像的多个聚类数训练模型,然后展示了如何使用现成的深度学习模型将聚类用于相应的分割掩码的因子化。我们的结果表明,F-SEG为H&E病理学图像提供了 robust的无监督分割能力,并且利用病理学基础模型可以大大提高分割质量。我们讨论并提出了用于评估无监督分割在病理学中表现的方法。
https://arxiv.org/abs/2409.05697