Background: Voxel-based analysis (VBA) for population level radiotherapy (RT) outcomes modeling requires topology preserving inter-patient deformable image registration (DIR) that preserves tumors on moving images while avoiding unrealistic deformations due to tumors occurring on fixed images. Purpose: We developed a tumor-aware recurrent registration (TRACER) deep learning (DL) method and evaluated its suitability for VBA. Methods: TRACER consists of encoder layers implemented with stacked 3D convolutional long short term memory network (3D-CLSTM) followed by decoder and spatial transform layers to compute dense deformation vector field (DVF). Multiple CLSTM steps are used to compute a progressive sequence of deformations. Input conditioning was applied by including tumor segmentations with 3D image pairs as input channels. Bidirectional tumor rigidity, image similarity, and deformation smoothness losses were used to optimize the network in an unsupervised manner. TRACER and multiple DL methods were trained with 204 3D CT image pairs from patients with lung cancers (LC) and evaluated using (a) Dataset I (N = 308 pairs) with DL segmented LCs, (b) Dataset II (N = 765 pairs) with manually delineated LCs, and (c) Dataset III with 42 LC patients treated with RT. Results: TRACER accurately aligned normal tissues. It best preserved tumors, blackindicated by the smallest tumor volume difference of 0.24\%, 0.40\%, and 0.13 \% and mean square error in CT intensities of 0.005, 0.005, 0.004, computed between original and resampled moving image tumors, for Datasets I, II, and III, respectively. It resulted in the smallest planned RT tumor dose difference computed between original and resampled moving images of 0.01 Gy and 0.013 Gy when using a female and a male reference.
背景:基于体素(voxel)的分析(VBA)用于人口水平放射治疗(RT)结果建模需要具有拓扑保留的跨患者可变形图像配准(DIR),同时保留在运动图像中的肿瘤并避免由于肿瘤位于固定图像而产生的不现实形变。目的:我们开发了一种肿瘤感知的递归配准(TRACER)深度学习(DL)方法,并评估了其在VBA方面的适用性。方法:TRACER由包含堆叠3D卷积长短期记忆网络(3D-CLSTM)的编码器层组成,然后跟随着解码器和解剖变换层来计算密集变形矢量场(DVF)。使用多个CLSTM层计算变形 progressive sequence。通过在输入中应用肿瘤分割的3D图像对,以实现无监督训练。我们使用来自患有肺癌(LC)的204个3D CT图像对作为数据集I,II,III的训练数据。使用DL分割的LC对数据集I,II,III进行评估。结果:TRACER准确对正常组织进行对齐。在数据集I,II,III上,它最佳地保留了肿瘤,通过最小肿瘤体积差为0.24\%,0.40\%,和0.13 \%以及原初和重新采样运动图像肿瘤的平均方差为0.005,0.005,0.004,计算得到。当使用女性和男性参考时,它导致了原初和重新采样运动图像肿瘤剂量差的最小化,为0.01 Gy和0.013 Gy。
https://arxiv.org/abs/2409.11910
The recent development of deep learning large models in medicine shows remarkable performance in medical image analysis and diagnosis, but their large number of parameters causes memory and inference latency challenges. Knowledge distillation offers a solution, but the slide-level gradients cannot be backpropagated for student model updates due to high-resolution pathological images and slide-level labels. This study presents an Efficient Fine-tuning on Compressed Models (EFCM) framework with two stages: unsupervised feature distillation and fine-tuning. In the distillation stage, Feature Projection Distillation (FPD) is proposed with a TransScan module for adaptive receptive field adjustment to enhance the knowledge absorption capability of the student model. In the slide-level fine-tuning stage, three strategies (Reuse CLAM, Retrain CLAM, and End2end Train CLAM (ETC)) are compared. Experiments are conducted on 11 downstream datasets related to three large medical models: RETFound for retina, MRM for chest X-ray, and BROW for histopathology. The experimental results demonstrate that the EFCM framework significantly improves accuracy and efficiency in handling slide-level pathological image problems, effectively addressing the challenges of deploying large medical models. Specifically, it achieves a 4.33% increase in ACC and a 5.2% increase in AUC compared to the large model BROW on the TCGA-NSCLC and TCGA-BRCA datasets. The analysis of model inference efficiency highlights the high efficiency of the distillation fine-tuning method.
近年来,在医学领域中,深度学习大型模型的开发在医学图像分析和诊断方面表现出显著的性能,但它们具有大量的参数,导致记忆和推理延迟。知识蒸馏提供了解决方案,但由于高分辨率病理图像和层级的标签,学生模型的更新无法通过级联梯度进行反向传播。这项研究介绍了一种高效的可压缩模型(EFCM)框架,包括两个阶段:无监督特征蒸馏和微调。在蒸馏阶段,提出了使用TransScan模块的Feature Projection Distillation(FPD)策略,以自适应地调整学生模型的感官场以增强知识吸收能力。在微调阶段,比较了三种策略(重用CLAM,重置CLAM和端到端训练CLAM(ETC))。实验在三个大型医学模型相关的11个下游数据集上进行,包括视网膜、胸部X光片和病理学。实验结果表明,EFCM框架在处理层级的病理图像问题方面显著提高了准确性和效率,有效解决了部署大型医疗模型的挑战。具体来说,它比大型模型BROW在TCGA-NSCLC和TCGA-BRCA数据集上实现了4.33%的ACC和5.2%的AUC的提高。对模型推理效率的分析强调了蒸馏微调方法的效率。
https://arxiv.org/abs/2409.11817
Unsupervised video semantic compression (UVSC), i.e., compressing videos to better support various analysis tasks, has recently garnered attention. However, the semantic richness of previous methods remains limited, due to the single semantic learning objective, limited training data, etc. To address this, we propose to boost the UVSC task by absorbing the off-the-shelf rich semantics from VFMs. Specifically, we introduce a VFMs-shared semantic alignment layer, complemented by VFM-specific prompts, to flexibly align semantics between the compressed video and various VFMs. This allows different VFMs to collaboratively build a mutually-enhanced semantic space, guiding the learning of the compression model. Moreover, we introduce a dynamic trajectory-based inter-frame compression scheme, which first estimates the semantic trajectory based on the historical content, and then traverses along the trajectory to predict the future semantics as the coding context. This reduces the overall bitcost of the system, further improving the compression efficiency. Our approach outperforms previous coding methods on three mainstream tasks and six datasets.
未经监督的视频语义压缩(UVSC),即压缩视频以更好地支持各种分析任务,近年来引起了关注。然而,以前方法语义丰富度仍然有限,由于单一语义学习目标,训练数据等限制。为了解决这个问题,我们提出了一种通过吸收VFM的固有丰富语义来提高UVSC任务的方法。具体来说,我们引入了一个VFMs-shared语义对齐层,由VFM-特定的提示补充,以灵活对压缩视频和各种VFM之间的语义进行对齐。这允许不同的VFM共同构建一个相互增强的语义空间,指导压缩模型的学习。此外,我们还引入了一种基于动态轨迹的帧间压缩方案,该方案首先根据历史内容估计语义轨迹,然后沿着轨迹进行预测,作为编码上下文。这减少了系统的总比特成本,进一步提高了压缩效率。我们的方法在三个主流任务和六个数据集上的表现优于以前的编码方法。
https://arxiv.org/abs/2409.11718
The Forward-Forward (FF) algorithm is a recent, purely forward-mode learning method, that updates weights locally and layer-wise and supports supervised as well as unsupervised learning. These features make it ideal for applications such as brain-inspired learning, low-power hardware neural networks, and distributed learning in large models. However, while FF has shown promise on written digit recognition tasks, its performance on natural images and time-series remains a challenge. A key limitation is the need to generate high-quality negative examples for contrastive learning, especially in unsupervised tasks, where versatile solutions are currently lacking. To address this, we introduce the Self-Contrastive Forward-Forward (SCFF) method, inspired by self-supervised contrastive learning. SCFF generates positive and negative examples applicable across different datasets, surpassing existing local forward algorithms for unsupervised classification accuracy on MNIST (MLP: 98.7%), CIFAR-10 (CNN: 80.75%), and STL-10 (CNN: 77.3%). Additionally, SCFF is the first to enable FF training of recurrent neural networks, opening the door to more complex tasks and continuous-time video and text processing.
前馈-前馈(FF)算法是一种最近提出、完全朝前学习的算法,它通过局部和层级的更新来更新权重,并支持监督学习和无监督学习。这些特点使得它非常适合应用于诸如类脑学习、低功耗硬件神经网络和大模型分布式学习等应用。然而,尽管FF在书面数字识别任务上已经表现出良好的前景,但在自然图像和时间序列上的表现仍然具有挑战性。一个关键的限制是生成高质量的反例来进行对比学习,特别是在无监督任务中,目前还缺乏多样化的解决方案。为了解决这个问题,我们引入了自监督前馈-前馈(SCFF)方法,灵感来自自监督对比学习。SCFF生成适用于各种数据集的反例,超过了现有局部前馈算法的无监督分类准确率(MNIST:98.7%,CIFAR-10:80.75%,STL-10:77.3%)。此外,SCFF是第一个使FF训练循环神经网络成为可能,为更复杂任务和连续时间视频和文本处理打开了大门。
https://arxiv.org/abs/2409.11593
Out-of-distribution (OOD) detection is crucial for enhancing the generalization of AI models used in mammogram screening. Given the challenge of limited prior knowledge about OOD samples in external datasets, unsupervised generative learning is a preferable solution which trains the model to discern the normal characteristics of in-distribution (ID) data. The hypothesis is that during inference, the model aims to reconstruct ID samples accurately, while OOD samples exhibit poorer reconstruction due to their divergence from normality. Inspired by state-of-the-art (SOTA) hybrid architectures combining CNNs and transformers, we developed a novel backbone - HAND, for detecting OOD from large-scale digital screening mammogram studies. To boost the learning efficiency, we incorporated synthetic OOD samples and a parallel discriminator in the latent space to distinguish between ID and OOD samples. Gradient reversal to the OOD reconstruction loss penalizes the model for learning OOD reconstructions. An anomaly score is computed by weighting the reconstruction and discriminator loss. On internal RSNA mammogram held-out test and external Mayo clinic hand-curated dataset, the proposed HAND model outperformed encoder-based and GAN-based baselines, and interestingly, it also outperformed the hybrid CNN+transformer baselines. Therefore, the proposed HAND pipeline offers an automated efficient computational solution for domain-specific quality checks in external screening mammograms, yielding actionable insights without direct exposure to the private medical imaging data.
离散(OD)检测对于增强在乳腺筛查中使用的AI模型的泛化能力至关重要。由于在外部数据集中对OD样本的了解有限,无监督生成学习是一种更可取的解决方案,该解决方案训练模型以区分分布(ID)数据的正常特征。假设在推理过程中,模型旨在准确地重构ID样本,而OD样本由于其从正态性中分化而表现得更差。受到最先进的(SOTA)结合卷积神经网络(CNN)和Transformer的混合架构的启发,我们开发了一种名为HAND的新骨架,用于从大规模数字筛查乳腺X光片研究中检测OD。为了提高学习效率,我们在潜在空间中引入了合成OD样本和一个新的区分器,以区分ID和OD样本。对OD重建损失的梯度翻转惩罚模型学习OD重构。异常得分通过权衡重建和区分器损失来计算。在内部RSNA乳腺X光片持有者测试和外部梅奥诊所手动标注的数据集上,所提出的HAND模型超过了基于编码器的基线和基于GAN的基线,而且有趣的是,它还超过了基于CNN+Transformer的混合基线。因此,所提出的HAND流程为在 external screening mammograms 对领域特定质量检查提供自动高效的计算解决方案,同时不直接暴露于私有医疗成像数据。
https://arxiv.org/abs/2409.11534
Recent advancements in deep learning have shown impressive results in image and video denoising, leveraging extensive pairs of noisy and noise-free data for supervision. However, the challenge of acquiring paired videos for dynamic scenes hampers the practical deployment of deep video denoising techniques. In contrast, this obstacle is less pronounced in image denoising, where paired data is more readily available. Thus, a well-trained image denoiser could serve as a reliable spatial prior for video denoising. In this paper, we propose a novel unsupervised video denoising framework, named ``Temporal As a Plugin'' (TAP), which integrates tunable temporal modules into a pre-trained image denoiser. By incorporating temporal modules, our method can harness temporal information across noisy frames, complementing its power of spatial denoising. Furthermore, we introduce a progressive fine-tuning strategy that refines each temporal module using the generated pseudo clean video frames, progressively enhancing the network's denoising performance. Compared to other unsupervised video denoising methods, our framework demonstrates superior performance on both sRGB and raw video denoising datasets.
近年来在深度学习领域在图像和视频去噪方面的先进成果表明,利用大量噪声和无噪声数据的丰富对偶用于监督,取得了令人印象深刻的结果。然而,获取动态场景下的成对视频仍然是一个挑战,这阻碍了深度视频去噪技术的实际应用。相比之下,在图像去噪中,成对数据更易于获得,因此一个训练好的图像去噪器可以作为一个可靠的时空先验用于视频去噪。在本文中,我们提出了一种名为“Temporal As a Plugin” (TAP) 的无监督视频去噪框架,将可调整的时间模块集成到预训练的图像去噪器中。通过引入时间模块,我们的方法可以利用噪声帧中的时间信息,补充其空间去噪的能力。此外,我们引入了一种逐步微调策略,使用生成的伪干净视频帧来优化每个时间模块,逐步提高网络的消噪性能。与其它无监督视频去噪方法相比,我们的框架在SRGB和原始视频去噪数据集上都表现出卓越的性能。
https://arxiv.org/abs/2409.11256
Retinal fundus photography is significant in diagnosing and monitoring retinal diseases. However, systemic imperfections and operator/patient-related factors can hinder the acquisition of high-quality retinal images. Previous efforts in retinal image enhancement primarily relied on GANs, which are limited by the trade-off between training stability and output diversity. In contrast, the Schrödinger Bridge (SB), offers a more stable solution by utilizing Optimal Transport (OT) theory to model a stochastic differential equation (SDE) between two arbitrary distributions. This allows SB to effectively transform low-quality retinal images into their high-quality counterparts. In this work, we leverage the SB framework to propose an image-to-image translation pipeline for retinal image enhancement. Additionally, previous methods often fail to capture fine structural details, such as blood vessels. To address this, we enhance our pipeline by introducing Dynamic Snake Convolution, whose tortuous receptive field can better preserve tubular structures. We name the resulting retinal fundus image enhancement framework the Context-aware Unpaired Neural Schrödinger Bridge (CUNSB-RFIE). To the best of our knowledge, this is the first endeavor to use the SB approach for retinal image enhancement. Experimental results on a large-scale dataset demonstrate the advantage of the proposed method compared to several state-of-the-art supervised and unsupervised methods in terms of image quality and performance on downstream tasks.The code is available at \url{this https URL}.
视网膜 fundus 摄影在诊断和监测视网膜疾病方面具有重要意义。然而,全身不完善和操作者/患者相关因素可能阻碍高质量视网膜图像的获取。先前对视网膜图像增强的努力主要依赖于 GAN,它们的训练稳定性与输出多样性之间存在权衡。相比之下,Schrödinger Bridge(SB)通过利用最优传输(OT)理论来建模两个任意分布之间的随机微分方程(SDE),提供了一个更稳定的解决方案。这使得SB能够有效地将低质量的视网膜图像转换为高质量的同类。 在这项工作中,我们利用SB框架提出了一个图像到图像的视网膜图像增强管道。此外,以前的方法通常无法捕捉到细结构细节,如血管。为了解决这个问题,我们通过引入动态蛇卷积来增强我们的管道,该卷积的曲折的接收场可以更好地保留管状结构。我们将这种增强后的视网膜 fundus 图像命名为“上下文感知无配对神经 Schrödinger Bridge”(CUNSB-RFIE)。据我们所知,这是第一个使用SB方法进行视网膜图像增强的尝试。在一大型数据集上的实验结果表明,与几种最先进的监督和无监督方法相比,所提出方法在图像质量和下游任务上的性能具有优势。 代码可在此处访问:\url{this <https:// URL>}。
https://arxiv.org/abs/2409.10966
This paper proposes Attention-Seeker, an unsupervised keyphrase extraction method that leverages self-attention maps from a Large Language Model to estimate the importance of candidate phrases. Our approach identifies specific components - such as layers, heads, and attention vectors - where the model pays significant attention to the key topics of the text. The attention weights provided by these components are then used to score the candidate phrases. Unlike previous models that require manual tuning of parameters (e.g., selection of heads, prompts, hyperparameters), Attention-Seeker dynamically adapts to the input text without any manual adjustments, enhancing its practical applicability. We evaluate Attention-Seeker on four publicly available datasets: Inspec, SemEval2010, SemEval2017, and Krapivin. Our results demonstrate that, even without parameter tuning, Attention-Seeker outperforms most baseline models, achieving state-of-the-art performance on three out of four datasets, particularly excelling in extracting keyphrases from long documents.
本文提出了一种名为Attention-Seeker的无监督关键词提取方法,该方法利用大型语言模型中的自注意力图来估计候选短语的重要性。我们的方法确定模型中关注文本具体组件 - 例如层、头和注意力向量 - 其中模型对关键词主题非常关注。这些组件提供的关注权重随后用于评分候选短语。与之前需要手动调整参数(例如选择头、提示、超参数)的不同模型相比,Attention-Seeker在没有任何手动调整的情况下动态地适应输入文本,提高了其实际应用效果。我们对Attention-Seeker在四个公开可用的数据集:Inspec,SemEval2010,SemEval2017和Krapivin进行了评估。我们的结果表明,即使没有参数调整,Attention-Seeker在大多数基线模型之上表现优异,在三个数据集上实现了最先进的性能,特别是在从长文档中提取关键词短语方面表现出色。
https://arxiv.org/abs/2409.10907
Multi-frequency Electrical Impedance Tomography (mfEIT) is a promising biomedical imaging technique that estimates tissue conductivities across different frequencies. Current state-of-the-art (SOTA) algorithms, which rely on supervised learning and Multiple Measurement Vectors (MMV), require extensive training data, making them time-consuming, costly, and less practical for widespread applications. Moreover, the dependency on training data in supervised MMV methods can introduce erroneous conductivity contrasts across frequencies, posing significant concerns in biomedical applications. To address these challenges, we propose a novel unsupervised learning approach based on Multi-Branch Attention Image Prior (MAIP) for mfEIT reconstruction. Our method employs a carefully designed Multi-Branch Attention Network (MBA-Net) to represent multiple frequency-dependent conductivity images and simultaneously reconstructs mfEIT images by iteratively updating its parameters. By leveraging the implicit regularization capability of the MBA-Net, our algorithm can capture significant inter- and intra-frequency correlations, enabling robust mfEIT reconstruction without the need for training data. Through simulation and real-world experiments, our approach demonstrates performance comparable to, or better than, SOTA algorithms while exhibiting superior generalization capability. These results suggest that the MAIP-based method can be used to improve the reliability and applicability of mfEIT in various settings.
多频电气阻抗成像(mfEIT)是一种有前景的生物医学成像技术,它通过不同频率估计组织导电性。目前最先进的(SOTA)算法,这些算法依赖于监督学习和多测量向量(MMV),需要大量的训练数据,导致它们耗时、昂贵且不适用于广泛的应用。此外,在监督MMV方法中,训练数据对导电性对比的依赖可能会在频率之间引入错误的导电性对比,对生物医学应用造成重大关切。为了应对这些挑战,我们提出了一个基于多分支注意力图像优先(MAIP)的新无监督学习方法用于mfEIT重构。我们的方法采用精心设计的Multi-Branch Attention网络(MBA-Net)来表示多个频率相关的导电性图像,并通过迭代更新参数同时重构mfEIT图像。通过利用MBA-Net的隐式正则化能力,我们的算法可以捕捉到显著的跨频率和跨频率的关联,实现无需训练数据的稳健mfEIT重构。通过仿真和现实世界的实验,我们的方法在性能上与或优于现有SOTA算法,同时表现出卓越的泛化能力。这些结果表明,基于MAIP的方法可以提高mfEIT在各种环境下的可靠性和适用性。
https://arxiv.org/abs/2409.10794
Iterative self-training, or iterative pseudo-labeling (IPL)--using an improved model from the current iteration to provide pseudo-labels for the next iteration--has proven to be a powerful approach to enhance the quality of speaker representations. Recent applications of IPL in unsupervised speaker recognition start with representations extracted from very elaborate self-supervised methods (e.g., DINO). However, training such strong self-supervised models is not straightforward (they require hyper-parameters tuning and may not generalize to out-of-domain data) and, moreover, may not be needed at all. To this end, we show the simple, well-studied, and established i-vector generative model is enough to bootstrap the IPL process for unsupervised learning of speaker representations. We also systematically study the impact of other components on the IPL process, which includes the initial model, the encoder, augmentations, the number of clusters, and the clustering algorithm. Remarkably, we find that even with a simple and significantly weaker initial model like i-vector, IPL can still achieve speaker verification performance that rivals state-of-the-art methods.
迭代自训练,或迭代伪标签(IPL)--使用当前迭代的新模型来提供下一个迭代的反向样本--已被证明是一种增强说话人表示质量的有效方法。最近在无监督语音识别中应用IPL的起始版本始于从非常复杂的自监督方法(例如DINO)中提取的表示。然而,训练这种强大的自监督模型并不是一件简单的事情(它们需要超参数调整,可能不会泛化到异质数据)而且,事实上,可能并不需要这样做。为此,我们证明了简单的、有良好研究基础的、已经确立的i向量生成模型足够启动IPL过程,用于无监督地学习说话人表示。我们还系统地研究了其他组件对IPL过程的影响,包括初始模型、编码器、增强、聚类数量和聚类算法。值得注意的是,我们发现,即使使用一个简单且显著弱化的初始模型(如i向量),IPL仍可以实现与最先进方法的匹敌的说话人验证性能。
https://arxiv.org/abs/2409.10791
This study explores using embedding rank as an unsupervised evaluation metric for general-purpose speech encoders trained via self-supervised learning (SSL). Traditionally, assessing the performance of these encoders is resource-intensive and requires labeled data from the downstream tasks. Inspired by the vision domain, where embedding rank has shown promise for evaluating image encoders without tuning on labeled downstream data, this work examines its applicability in the speech domain, considering the temporal nature of the signals. The findings indicate rank correlates with downstream performance within encoder layers across various downstream tasks and for in- and out-of-domain scenarios. However, rank does not reliably predict the best-performing layer for specific downstream tasks, as lower-ranked layers can outperform higher-ranked ones. Despite this limitation, the results suggest that embedding rank can be a valuable tool for monitoring training progress in SSL speech models, offering a less resource-demanding alternative to traditional evaluation methods.
这项研究探讨了使用嵌入排名作为无监督评估指标来评估通过自监督学习(SSL)训练的一般用途语音编码器的性能。传统上,评估这些编码器的性能需要大量的资源,并需要从下游任务的标记数据中进行标注。受到视觉领域中嵌入排名已经用于评估没有在标记下游数据上进行调整的图像编码器的好处的启发,本研究探讨了在语音领域中嵌入排名的应用,考虑了信号的时间特性。研究结果表明,在各种下游任务中,嵌入排名与编码器层之间的排名相关,并且对于进入和离开域场景都成立。然而,嵌入排名不一定会可靠地预测特定下游任务的最好层,因为较低的排名层可能会比较高的层表现更好。尽管存在这个局限性,研究结果表明,嵌入排名可以成为监视 SSL 语音模型训练进展的有价值的工具,提供了一种相对资源较少的评估方法。
https://arxiv.org/abs/2409.10787
Learning with neural networks from a continuous stream of visual information presents several challenges due to the non-i.i.d. nature of the data. However, it also offers novel opportunities to develop representations that are consistent with the information flow. In this paper we investigate the case of unsupervised continual learning of pixel-wise features subject to multiple motion-induced constraints, therefore named motion-conjugated feature representations. Differently from existing approaches, motion is not a given signal (either ground-truth or estimated by external modules), but is the outcome of a progressive and autonomous learning process, occurring at various levels of the feature hierarchy. Multiple motion flows are estimated with neural networks and characterized by different levels of abstractions, spanning from traditional optical flow to other latent signals originating from higher-level features, hence called higher-order motions. Continuously learning to develop consistent multi-order flows and representations is prone to trivial solutions, which we counteract by introducing a self-supervised contrastive loss, spatially-aware and based on flow-induced similarity. We assess our model on photorealistic synthetic streams and real-world videos, comparing to pre-trained state-of-the art feature extractors (also based on Transformers) and to recent unsupervised learning models, significantly outperforming these alternatives.
从连续的视觉信息中通过神经网络学习具有几个挑战性的问题,因为数据的非一致性。然而,它也带来了开发符合信息流一致性的表示的新机会。在本文中,我们研究了在受到多个运动引导约束的情况下进行无监督连续学习像素级特征的情况,因此称为运动共轭特征表示。与现有方法不同,运动不是给定的信号(无论是地面真实值还是由外部模块估计的),而是发生在前馈和学习过程中的一种渐进和自主学习的结果,发生于特征层次结构的各个层次。我们使用神经网络估计多个运动流,并对其进行建模,具有不同级别的抽象,从传统的光学流到来自更高层次特征的 其他潜在信号,因此称为高阶运动。连续学习开发一致的多阶流和表示很容易导致平凡解,我们通过引入自监督对比损失、空间感知和基于流引起的相似来对抗这种平凡解。我们在照片现实主义合成流和现实世界的视频中评估我们的模型,与预训练的基于Transformer的状态-of-the-art特征提取器(同样基于Transformer)和最近的无监督学习模型相比,显著超过了这些替代方案。
https://arxiv.org/abs/2409.11441
Semi-supervised medical image segmentation has shown promise in training models with limited labeled data and abundant unlabeled data. However, state-of-the-art methods ignore a potentially valuable source of unsupervised semantic information -- spatial registration transforms between image volumes. To address this, we propose CCT-R, a contrastive cross-teaching framework incorporating registration information. To leverage the semantic information available in registrations between volume pairs, CCT-R incorporates two proposed modules: Registration Supervision Loss (RSL) and Registration-Enhanced Positive Sampling (REPS). The RSL leverages segmentation knowledge derived from transforms between labeled and unlabeled volume pairs, providing an additional source of pseudo-labels. REPS enhances contrastive learning by identifying anatomically-corresponding positives across volumes using registration transforms. Experimental results on two challenging medical segmentation benchmarks demonstrate the effectiveness and superiority of CCT-R across various semi-supervised settings, with as few as one labeled case. Our code is available at this https URL.
半监督医学图像分割已经在训练模型时使用有限标注数据和丰富无标注数据表现出良好的效果。然而,最先进的方法忽略了潜在的有价值的无监督语义信息——空间注册在图像卷之间转换。为解决这个问题,我们提出了CCT-R,一种包含注册信息的反向传播框架。为了利用体积对之间存在的语义信息,CCT-R包括两个提议模块:注册监督损失(RSL)和注册增强阳性采样(REPS)。RSL利用从标注和未标注体积对之间的变换得到的分割知识,提供了一个额外的伪标签来源。REPS通过使用注册变换识别解剖学对应的可信积极样本,增强了对抗性学习。在两个具有挑战性的医疗分割基准上进行的实验结果表明,CCT-R在各种半监督设置中具有有效性和优越性,甚至只需要一个标注案例。我们的代码可在此处访问:https://url.
https://arxiv.org/abs/2409.10422
Recent advancements in autonomous driving have seen a paradigm shift towards end-to-end learning paradigms, which map sensory inputs directly to driving actions, thereby enhancing the robustness and adaptability of autonomous vehicles. However, these models often sacrifice interpretability, posing significant challenges to trust, safety, and regulatory compliance. To address these issues, we introduce DRIVE -- Dependable Robust Interpretable Visionary Ensemble Framework in Autonomous Driving, a comprehensive framework designed to improve the dependability and stability of explanations in end-to-end unsupervised autonomous driving models. Our work specifically targets the inherent instability problems observed in the Driving through the Concept Gridlock (DCG) model, which undermine the trustworthiness of its explanations and decision-making processes. We define four key attributes of DRIVE: consistent interpretability, stable interpretability, consistent output, and stable output. These attributes collectively ensure that explanations remain reliable and robust across different scenarios and perturbations. Through extensive empirical evaluations, we demonstrate the effectiveness of our framework in enhancing the stability and dependability of explanations, thereby addressing the limitations of current models. Our contributions include an in-depth analysis of the dependability issues within the DCG model, a rigorous definition of DRIVE with its fundamental properties, a framework to implement DRIVE, and novel metrics for evaluating the dependability of concept-based explainable autonomous driving models. These advancements lay the groundwork for the development of more reliable and trusted autonomous driving systems, paving the way for their broader acceptance and deployment in real-world applications.
近年来,自动驾驶技术的进步使人们看到了一种端到端学习范式的范式转移,该范式将感知输入直接映射到驾驶动作,从而增强了自动驾驶车辆的稳健性和适应性。然而,这些模型通常会牺牲可解释性,对可信性、安全性和法规遵从性构成重大挑战。为了应对这些问题,我们引入了DRIVE--可信赖的鲁棒解释性视觉集成框架在自动驾驶中,这是一个全面的设计,旨在提高端到端无监督自动驾驶模型的可靠性和稳定性。我们的工作特别针对了 Driving Through the Concept Gridlock (DCG) 模型中观察到的固有不稳定问题,该问题削弱了其解释和决策过程的可信度。我们定义了四个DRIVE的关键属性:一致的可解释性、稳定的可解释性、一致的输出和稳定的输出。这些属性共同确保了解释在不同的场景和扰动下保持可靠和稳健。通过广泛的实证评估,我们证明了我们的框架在增强解释的稳定性和可靠性方面非常有效,从而解决了现有模型的局限性。我们的贡献包括对DCG模型内部可靠性问题进行深入分析、对DRIVE的基本属性进行严谨的定义、实现DRIVE框架以及基于概念的解释性自动驾驶模型的可靠性评估新指标。这些进步为开发更可靠、可信赖的自动驾驶系统奠定了基础,为它们在现实应用中的广泛接受和部署铺平道路。
https://arxiv.org/abs/2409.10330
Unsupervised anomaly detection is a daunting task, as it relies solely on normality patterns from the training data to identify unseen anomalies during testing. Recent approaches have focused on leveraging domain-specific transformations or perturbations to generate synthetic anomalies from normal samples. The objective here is to acquire insights into normality patterns by learning to differentiate between normal samples and these crafted anomalies. However, these approaches often encounter limitations when domain-specific transformations are not well-specified such as in tabular data, or when it becomes trivial to distinguish between them. To address these issues, we introduce a novel domain-agnostic method that employs a set of conditional perturbators and a discriminator. The perturbators are trained to generate input-dependent perturbations, which are subsequently utilized to construct synthetic anomalies, and the discriminator is trained to distinguish normal samples from them. We ensure that the generated anomalies are both diverse and hard to distinguish through two key strategies: i) directing perturbations to be orthogonal to each other and ii) constraining perturbations to remain in proximity to normal samples. Throughout experiments on real-world datasets, we demonstrate the superiority of our method over state-of-the-art benchmarks, which is evident not only in image data but also in tabular data, where domain-specific transformation is not readily accessible. Additionally, we empirically confirm the adaptability of our method to semi-supervised settings, demonstrating its capacity to incorporate supervised signals to enhance anomaly detection performance even further.
无监督异常检测是一项具有挑战性的任务,因为它仅依赖于训练数据中的正态性模式来在测试期间识别未见到的异常。最近的方法集中精力利用领域特定的变换或扰动从正常样本中生成合成异常。这里的目的是通过学习区分正常样本和这些定制异常来获得对正态性模式的洞察。然而,当领域特定的变换没有被充分指定时,这些方法常常会遇到局限,比如在表格数据中,或者当很难区分它们时变得简单化。为了应对这些局限,我们引入了一种新的领域无关方法,它采用一组条件扰动器和判别器。扰动器被训练生成与输入相关的扰动,随后用于构建合成异常,判别器被训练区分正常样本和它们。我们通过两种关键策略确保生成的异常具有多样性和难以区分性:i)将扰动器定向到彼此之间正交;ii)将扰动器约束在正常样本附近。在真实世界数据集的实验中,我们证明了我们的方法在超越现有基准测试上的优越性,这不仅表现在图像数据上,而且在表格数据上也表现明显。此外,我们通过实验验证了我们的方法在半监督设置中的适应性,表明它可以进一步增强异常检测的性能。
https://arxiv.org/abs/2409.10069
In this paper we present a new machine learning workflow with unsupervised learning techniques to identify domains within atomic force microscopy images obtained from polymer films. The goal of the workflow is to identify the spatial location of the two types of polymer domains with little to no manual intervention and calculate the domain size distributions which in turn can help qualify the phase separated state of the material as macrophase or microphase ordered or disordered domains. We briefly review existing approaches used in other fields, computer vision and signal processing that can be applicable for the above tasks that happen frequently in the field of polymer science and engineering. We then test these approaches from computer vision and signal processing on the AFM image dataset to identify the strengths and limitations of each of these approaches for our first task. For our first domain segmentation task, we found that the workflow using discrete Fourier transform or discrete cosine transform with variance statistics as the feature works the best. The popular ResNet50 deep learning approach from computer vision field exhibited relatively poorer performance in the domain segmentation task for our AFM images as compared to the DFT and DCT based workflows. For the second task, for each of 144 input AFM images, we then used an existing porespy python package to calculate the domain size distribution from the output of that image from DFT based workflow. The information and open source codes we share in this paper can serve as a guide for researchers in the polymer and soft materials fields who need ML modeling and workflows for automated analyses of AFM images from polymer samples that may have crystalline or amorphous domains, sharp or rough interfaces between domains, or micro or macrophase separated domains.
在本文中,我们提出了一个利用无监督学习技术来识别聚合物薄膜获得的原子力显微图像中的领域的新的机器学习工作流程。工作流程的目标是识别两种类型的聚合物域,且几乎不需要手动干预,并计算域大小分布,从而有助于鉴定材料的相分离状态,无论是微相还是粗相。我们简要回顾了计算机视觉和信号处理等领域中可应用于上述任务的现有方法。然后,我们对这些方法在AFM图像数据集上进行计算机视觉和信号处理进行了测试,以找出每个方法在我们第一个任务中的优缺点。 在我们的第一个领域分割任务中,我们发现使用离散傅里叶变换(DFT)或离散余弦变换(DCT)作为特征的 workflow 效果最佳。与计算机视觉领域中的热门 ResNet50 深度学习方法相比,我们的AFM图像在领域分割任务上的表现相对较差。 对于第二个任务,对于144个输入AFM图像,我们 then 使用现有的porespy python包从DFT based workflow的输出计算每个图像的域大小分布。本文中提供的信息以及开源代码可以作为研究人员在聚合物和软材料领域进行自动化分析AFM图像的ML建模和流程的指南。
https://arxiv.org/abs/2409.11438
Abnormal event detection or anomaly detection in surveillance videos is currently a challenge because of the diversity of possible events. Due to the lack of anomalous events at training time, anomaly detection requires the design of learning methods without supervision. In this work we propose an unsupervised approach for video anomaly detection with the aim to jointly optimize the objectives of the deep neural network and the anomaly detection task using a hybrid architecture. Initially, a convolutional autoencoder is pre-trained in an unsupervised manner with a fusion of depth, motion and appearance features. In the second step, we utilize the encoder part of the pre-trained autoencoder and extract the embeddings of the fused input. Now, we jointly train/ fine tune the encoder to map the embeddings to a hypercenter. Thus, embeddings of normal data fall near the hypercenter, whereas embeddings of anomalous data fall far away from the hypercenter.
异常事件检测或异常检测在监视视频中的挑战在于可能事件种类的多样性。由于训练时间缺乏异常事件,因此需要设计无需监督的异常检测学习方法。在这项工作中,我们提出了一种无需监督的视频异常检测方法,旨在使用混合架构共同优化深度神经网络和异常检测任务的 objectives。首先,在无需监督的情况下使用卷积自编码器预处理,通过融合深度、运动和特征来提高性能。在第二步中,我们利用预训练自编码器的编码器部分,并提取融合输入的嵌入。现在,我们共同训练/微调编码器以将嵌入映射到超圆心。因此,正常数据的嵌入接近超圆心,而异常数据的嵌入则远离超圆心。
https://arxiv.org/abs/2409.09804
Test Time Adaptation (TTA) has emerged as a practical solution to mitigate the performance degradation of Deep Neural Networks (DNNs) in the presence of corruption/ noise affecting inputs. Existing approaches in TTA continuously adapt the DNN, leading to excessive resource consumption and performance degradation due to accumulation of error stemming from lack of supervision. In this work, we propose Domain-Aware Real-Time Dynamic Adaptation (DARDA) to address such issues. Our key approach is to proactively learn latent representations of some corruption types, each one associated with a sub-network state tailored to correctly classify inputs affected by that corruption. After deployment, DARDA adapts the DNN to previously unseen corruptions in an unsupervised fashion by (i) estimating the latent representation of the ongoing corruption; (ii) selecting the sub-network whose associated corruption is the closest in the latent space to the ongoing corruption; and (iii) adapting DNN state, so that its representation matches the ongoing corruption. This way, DARDA is more resource efficient and can swiftly adapt to new distributions caused by different corruptions without requiring a large variety of input data. Through experiments with two popular mobile edge devices - Raspberry Pi and NVIDIA Jetson Nano - we show that DARDA reduces energy consumption and average cache memory footprint respectively by 1.74x and 2.64x with respect to the state of the art, while increasing the performance by 10.4%, 5.7% and 4.4% on CIFAR-10, CIFAR-100 and TinyImagenet.
Test Time Adaptation(TTA)已经成为在输入存在 corruption/噪声影响的情况下减轻深度神经网络(DNN)性能退化的实用解决方案。现有的TTA方法持续地调整DNN,导致由于缺乏监督而累积的错误导致的资源消耗和性能退化。在本文中,我们提出了领域感知实时动态适应(DARDA)来解决这类问题。我们关键的方法是主动学习一些 corruption类型的潜在表示,每个一个都与针对那种 corruption的子网络状态适配来正确分类受到该 corruption影响的输入。部署后,DARDA通过(i)估计持续 corruption的潜在表示;(ii)选择与当前 corruption 距离最近的子网络;(iii)适应DNN状态,使其表示与当前 corruption相匹配来以无需大量输入数据的方式快速适应新的分布。通过与两个流行的移动边缘设备 - Raspberry Pi和NVIDIA Jetson Nano - 的实验,我们分别将DARDA相对于最先进的性能降低了1.74倍和2.64倍,同时将能量消耗降低了10.4%,CIFAR-10提高了5.7%,TinyImagenet提高了4.4%。
https://arxiv.org/abs/2409.09753
Contemporary deep learning architectures lack principled means for capturing and handling fundamental visual concepts, like objects, shapes, geometric transforms, and other higher-level structures. We propose a neurosymbolic architecture that uses a domain-specific language to capture selected priors of image formation, including object shape, appearance, categorization, and geometric transforms. We express template programs in that language and learn their parameterization with features extracted from the scene by a convolutional neural network. When executed, the parameterized program produces geometric primitives which are rendered and assessed for correspondence with the scene content and trained via auto-association with gradient. We confront our approach with a baseline method on a synthetic benchmark and demonstrate its capacity to disentangle selected aspects of the image formation process, learn from small data, correct inference in the presence of noise, and out-of-sample generalization.
当代深度学习架构缺乏捕捉和处理基本视觉概念(如物体、形状、几何变换等)有原则的方法。我们提出了一种神经符号架构,它使用领域特定语言来捕捉图像形成过程中的选择性先验,包括物体形状、外观、分类和几何变换。我们在该语言中表达模板程序,并通过由卷积神经网络提取的特征来学习它们的参数化。当执行参数化程序时,它产生几何素数,并对其与场景内容的对应关系进行渲染和评估,并通过自协调整学来训练。我们用一种基于合成基准的方法与我们的方法进行比较,并证明了其在区分图像形成过程中选择的方面具有能力,从小数据中学习,在噪声存在时纠正推理,并在离散的样本之外进行泛化。
https://arxiv.org/abs/2409.09716
In recent years, there has been a surge in the publication of clinical trial reports, making it challenging to conduct systematic reviews. Automatically extracting Population, Intervention, Comparator, and Outcome (PICO) from clinical trial studies can alleviate the traditionally time-consuming process of manually scrutinizing systematic reviews. Existing approaches of PICO frame extraction involves supervised approach that relies on the existence of manually annotated data points in the form of BIO label tagging. Recent approaches, such as In-Context Learning (ICL), which has been shown to be effective for a number of downstream NLP tasks, require the use of labeled examples. In this work, we adopt ICL strategy by employing the pretrained knowledge of Large Language Models (LLMs), gathered during the pretraining phase of an LLM, to automatically extract the PICO-related terminologies from clinical trial documents in unsupervised set up to bypass the availability of large number of annotated data instances. Additionally, to showcase the highest effectiveness of LLM in oracle scenario where large number of annotated samples are available, we adopt the instruction tuning strategy by employing Low Rank Adaptation (LORA) to conduct the training of gigantic model in low resource environment for the PICO frame extraction task. Our empirical results show that our proposed ICL-based framework produces comparable results on all the version of EBM-NLP datasets and the proposed instruction tuned version of our framework produces state-of-the-art results on all the different EBM-NLP datasets. Our project is available at \url{this https URL}.
近年来,临床试验报告的发表数量激增,使得进行系统综述变得具有挑战性。通过自动从临床试验研究中提取人群、干预、比较者和结局(PICO)等概念,可以减轻传统上需要手动检查系统综述的时间消耗。现有的PICO框架提取方法涉及监督方法,依赖于手动标注的数据点以的形式为BIO标签标签。 一些最近的方法,如In-Context Learning(ICL),已经被证明对于许多下游自然语言处理任务非常有效。这些方法需要使用标注的示例。在这项工作中,我们采用ICL策略,通过使用LLM在预训练阶段收集的大规模语言模型预训练知识,自动从临床试验文档中提取与PICO相关的术语,以避免缺乏大量注释数据实例的情况。此外,为了展示LLM在大型样本情况下在Oracle情景中的最高有效性,我们采用低秩适应策略(LORA)对用于PICO框架提取任务的巨大模型在低资源环境中的训练进行调整。我们的实证结果表明,基于ICL的框架在所有EMBOSS-NLP数据集版本上都产生了类似的结果,而我们所提出的指令调整版本框架在所有不同的EMBOSS-NLP数据集上都实现了最先进的结果。我们的项目可以在这里访问:https:// this URL。
https://arxiv.org/abs/2409.09704