Purpose: Surgical video is an important data stream for gesture recognition. Thus, robust visual encoders for those data-streams is similarly important. Methods: Leveraging the Bridge-Prompt framework, we fine-tune a pre-trained vision-text model (CLIP) for gesture recognition in surgical videos. This can utilize extensive outside video data such as text, but also make use of label meta-data and weakly supervised contrastive losses. Results: Our experiments show that prompt-based video encoder outperforms standard encoders in surgical gesture recognition tasks. Notably, it displays strong performance in zero-shot scenarios, where gestures/tasks that were not provided during the encoder training phase are included in the prediction phase. Additionally, we measure the benefit of inclusion text descriptions in the feature extractor training schema. Conclusion: Bridge-Prompt and similar pre-trained+fine-tuned video encoder models present significant visual representation for surgical robotics, especially in gesture recognition tasks. Given the diverse range of surgical tasks (gestures), the ability of these models to zero-shot transfer without the need for any task (gesture) specific retraining makes them invaluable.
目的:手术视频是手势识别的重要数据流。因此,对于这些数据流,同样重要的是具有稳健的视觉编码器。方法:利用Bridge-Prompt框架,我们对一个预训练的视觉-文本模型(CLIP)进行手术视频手势识别的微调。这可以利用广泛的外部视频数据(如文本),但还利用标签元数据和弱监督的对比损失。结果:我们的实验结果表明,基于提示的视频编码器在手术手势识别任务中优于标准编码器。值得注意的是,在编码器训练阶段没有提供手势/任务的情况下,表现出优异的零散场景性能。此外,我们还在特征提取器训练方案中测量了包含文本描述的好处。结论:Bridge-Prompt和其他预训练+微调的视频编码器模型为手术机器人提供了显著的视觉表示,特别是在手势识别任务中。考虑到手术任务的多样范围(手势),这些模型在没有任何特定任务(手势)重新训练的情况下实现零散场景转移的能力使其无价。
https://arxiv.org/abs/2403.19786
In recent years, research on point weakly supervised object detection (PWSOD) methods in the field of computer vision has attracted people's attention. However, existing pseudo labels generation methods perform poorly in a small amount of supervised annotation data and dense object detection tasks. We consider the generation of weakly supervised pseudo labels as the result of model's sparse output, and propose a method called Sparse Generation to make pseudo labels sparse. It constructs dense tensors through the relationship between data and detector model, optimizes three of its parameters, and obtains a sparse tensor via coordinated calculation, thereby indirectly obtaining higher quality pseudo labels, and solving the model's density problem in the situation of only a small amount of supervised annotation data can be used. On two broadly used open-source datasets (RSOD, SIMD) and a self-built dataset (Bullet-Hole), the experimental results showed that the proposed method has a significant advantage in terms of overall performance metrics, comparing to that state-of-the-art method.
近年来,计算机视觉领域关于点弱监督物体检测(PWSOD)方法的研究引起了人们的关注。然而,现有的伪标签生成方法在少量监督标注数据和密集物体检测任务上表现不佳。我们考虑弱监督伪标签生成是模型稀疏输出的结果,并提出了名为Sparse Generation的方法来使伪标签稀疏。它通过数据与检测器模型之间的关系构建密集向量,优化三个参数,并通过协同计算获得稀疏向量,从而间接地获得更高质量的伪标签,解决了仅少量监督标注数据情况下模型密度的难题。在两个广泛使用的开源数据集(RSOD,SIMD)和自建数据集(Bullet-Hole)上进行的实验结果表明,与最先进的 methods相比,所提出的方法在整体性能指标上具有显著的优势。
https://arxiv.org/abs/2403.19306
We propose a voting-driven semi-supervised approach to automatically acquire the typical duration of an event and use it as pseudo-labeled data. The human evaluation demonstrates that our pseudo labels exhibit surprisingly high accuracy and balanced coverage. In the temporal commonsense QA task, experimental results show that using only pseudo examples of 400 events, we achieve performance comparable to the existing BERT-based weakly supervised approaches that require a significant amount of training examples. When compared to the RoBERTa baselines, our best approach establishes state-of-the-art performance with a 7% improvement in Exact Match.
我们提出了一种投票驱动的半监督方法,用于自动获取事件的典型持续时间,并将其用作伪标签数据。人类评估表明,我们的伪标签具有令人惊讶的准确性和平衡覆盖。在时间常识问题问答任务中,实验结果表明,仅使用400个事件的伪例子,我们在实现与现有基于BERT的弱监督方法相当的表现方面取得了成功。与RoBERTa基线相比,我们最佳的方法通过提高Exact Match的7%的性能,实现了最先进的表现。
https://arxiv.org/abs/2403.18504
In digital pathology, the multiple instance learning (MIL) strategy is widely used in the weakly supervised histopathology whole slide image (WSI) classification task where giga-pixel WSIs are only labeled at the slide level. However, existing attention-based MIL approaches often overlook contextual information and intrinsic spatial relationships between neighboring tissue tiles, while graph-based MIL frameworks have limited power to recognize the long-range dependencies. In this paper, we introduce the integrative graph-transformer framework that simultaneously captures the context-aware relational features and global WSI representations through a novel Graph Transformer Integration (GTI) block. Specifically, each GTI block consists of a Graph Convolutional Network (GCN) layer modeling neighboring relations at the local instance level and an efficient global attention model capturing comprehensive global information from extensive feature embeddings. Extensive experiments on three publicly available WSI datasets: TCGA-NSCLC, TCGA-RCC and BRIGHT, demonstrate the superiority of our approach over current state-of-the-art MIL methods, achieving an improvement of 1.0% to 2.6% in accuracy and 0.7%-1.6% in AUROC.
在数字病理学中,在弱监督下的组织病理学全片图像(WSI)分类任务中,仅在玻片级别对巨型像素WSI进行标注的情况中,多实例学习(MIL)策略被广泛使用。然而,现有的以注意力为基础的MIL方法通常忽视了相邻组织薄片之间的上下文信息以及它们的固有空间关系,而基于图的MIL框架则对远距离依赖关系的识别能力有限。在本文中,我们引入了一个整合图Transformer框架,通过一种新颖的图Transformer集成(GTI)块同时捕捉上下文感知的关联特征和全局WSI表示。具体来说,每个GTI块包括一个局部图卷积网络(GCN)层建模邻近关系以及一个高效的全局注意力模型,从广泛的特征嵌入中捕获全面的全局信息。对三个公开可用的WSI数据集:TCGA-NSCLC、TCGA-RCC和BRIGHT的实验表明,与其他最先进的MIL方法相比,我们的方法具有优越性,实现了准确度的提高1.0%至2.6%和AUROC的提高0.7%至1.6%。
https://arxiv.org/abs/2403.18134
In this paper, we tackle a new and challenging problem of text-driven generation of 3D garments with high-quality textures. We propose "WordRobe", a novel framework for the generation of unposed & textured 3D garment meshes from user-friendly text prompts. We achieve this by first learning a latent representation of 3D garments using a novel coarse-to-fine training strategy and a loss for latent disentanglement, promoting better latent interpolation. Subsequently, we align the garment latent space to the CLIP embedding space in a weakly supervised manner, enabling text-driven 3D garment generation and editing. For appearance modeling, we leverage the zero-shot generation capability of ControlNet to synthesize view-consistent texture maps in a single feed-forward inference step, thereby drastically decreasing the generation time as compared to existing methods. We demonstrate superior performance over current SOTAs for learning 3D garment latent space, garment interpolation, and text-driven texture synthesis, supported by quantitative evaluation and qualitative user study. The unposed 3D garment meshes generated using WordRobe can be directly fed to standard cloth simulation & animation pipelines without any post-processing.
在本文中,我们解决了一个新颖且具有挑战性的问题:用高品质纹理的文本驱动生成3D服装。我们提出了 "WordRobe",一种用于从用户友好的文本提示中生成非姿态&纹理3D服装网格的新颖框架。我们通过首先使用一种新颖的粗到细的训练策略和学习3D服装的潜在表示来达到这个目标,并使用潜在解分枝损失促进更好的潜在互插。随后,我们以弱监督的方式将服装潜在空间与CLIP嵌入空间对齐,实现基于文本的3D服装生成和编辑。对于外观建模,我们利用控制网的零 shot生成能力,在单次前馈推理步骤中合成视图一致的纹理映射,从而大大缩短了生成时间,与现有方法相比。我们在现有SOTAs中证明了卓越的性能,包括学习3D服装潜在空间、服装插值和文本驱动纹理合成,这是通过定量和用户研究来支持的。使用WordRobe生成的非姿态3D服装网格可以直接输入到标准的布料模拟和动画流水线中,无需后处理。
https://arxiv.org/abs/2403.17541
Long-tailed data is prevalent in real-world classification tasks and heavily relies on supervised information, which makes the annotation process exceptionally labor-intensive and time-consuming. Unfortunately, despite being a common approach to mitigate labeling costs, existing weakly supervised learning methods struggle to adequately preserve supervised information for tail samples, resulting in a decline in accuracy for the tail classes. To alleviate this problem, we introduce a novel weakly supervised labeling setting called Reduced Label. The proposed labeling setting not only avoids the decline of supervised information for the tail samples, but also decreases the labeling costs associated with long-tailed data. Additionally, we propose an straightforward and highly efficient unbiased framework with strong theoretical guarantees to learn from these Reduced Labels. Extensive experiments conducted on benchmark datasets including ImageNet validate the effectiveness of our approach, surpassing the performance of state-of-the-art weakly supervised methods.
真实世界分类任务中,长尾数据普遍存在,并且高度依赖监督信息,这使得标注过程异常费力且耗时。不幸的是,尽管弱监督学习方法是缓解标签成本的一种常见方法,但现有的弱监督学习方法很难充分保留监督信息,导致尾样本的准确性下降。为了缓解这个问题,我们引入了一个新的弱监督标注设置,称为Reduced Label。所提出的标注设置不仅避免了尾样本监督信息下降的问题,而且降低了与长尾数据相关的标注成本。此外,我们提出了一个简单而高效的无偏框架,具有强大的理论保证,可以从这些弱监督标签中学习。在包括ImageNet在内的基准数据集上进行的广泛实验证实了我们方法的有效性,超过了最先进的弱监督学习方法的性能。
https://arxiv.org/abs/2403.16469
Despite the advancements in deep learning for camera relocalization tasks, obtaining ground truth pose labels required for the training process remains a costly endeavor. While current weakly supervised methods excel in lightweight label generation, their performance notably declines in scenarios with sparse views. In response to this challenge, we introduce WSCLoc, a system capable of being customized to various deep learning-based relocalization models to enhance their performance under weakly-supervised and sparse view conditions. This is realized with two stages. In the initial stage, WSCLoc employs a multilayer perceptron-based structure called WFT-NeRF to co-optimize image reconstruction quality and initial pose information. To ensure a stable learning process, we incorporate temporal information as input. Furthermore, instead of optimizing SE(3), we opt for $\mathfrak{sim}(3)$ optimization to explicitly enforce a scale constraint. In the second stage, we co-optimize the pre-trained WFT-NeRF and WFT-Pose. This optimization is enhanced by Time-Encoding based Random View Synthesis and supervised by inter-frame geometric constraints that consider pose, depth, and RGB information. We validate our approaches on two publicly available datasets, one outdoor and one indoor. Our experimental results demonstrate that our weakly-supervised relocalization solutions achieve superior pose estimation accuracy in sparse-view scenarios, comparable to state-of-the-art camera relocalization methods. We will make our code publicly available.
尽管在相机重定位任务中,深度学习取得了进步,但获取训练过程中所需的地面真值姿态标签仍然是一个昂贵的过程。虽然当前的弱监督方法在轻量级标签生成方面表现出色,但在稀疏视图场景中,它们的性能明显下降。为了应对这个挑战,我们引入了WSCLoc,一种可以自定义以适应各种基于深度学习的姿态还原模型的系统,以增强其在弱监督和稀疏视图条件下的性能。这是通过两个阶段实现的。在第一阶段,WSCLoc采用一种名为WFT-NeRF的多层感知器结构与图像重建质量和初始姿态信息共同优化。为了确保稳定的学习过程,我们将时间信息作为输入。此外,我们选择$\boldsymbol{\rm sim}(3)$优化,明确地强制执行规模约束。在第二阶段,我们共同优化预训练的WFT-NeRF和WFT-Pose。通过基于时间编码的随机视图合成和监督帧间几何约束的协同优化,这个优化得到了增强。我们对两个公开可用的数据集进行了验证,一个是户外,一个是室内。我们的实验结果表明,我们的弱监督姿态还原解决方案在稀疏视图场景中实现了卓越的姿态估计精度,与最先进的相机姿态还原方法相当。我们将公开我们的代码。
https://arxiv.org/abs/2403.15272
Deep learning enables the modelling of high-resolution histopathology whole-slide images (WSI). Weakly supervised learning of tile-level data is typically applied for tasks where labels only exist on the patient or WSI level (e.g. patient outcomes or histological grading). In this context, there is a need for improved spatial interpretability of predictions from such models. We propose a novel method, Wsi rEgion sElection aPproach (WEEP), for model interpretation. It provides a principled yet straightforward way to establish the spatial area of WSI required for assigning a particular prediction label. We demonstrate WEEP on a binary classification task in the area of breast cancer computational pathology. WEEP is easy to implement, is directly connected to the model-based decision process, and offers information relevant to both research and diagnostic applications.
深度学习使得高分辨率病理切片图像(WSI)的建模成为可能。通常,用于仅在患者或WSI水平上存在标签的任务(例如患者结局或组织分级)时,采用弱监督学习来处理瓷砖级数据。在這種情況下,需要提高来自此类模型的预测的空间可解释性。我们提出了一种新的方法,称为Wsi region selection approach(WEEP),用于模型解释。它提供了一个理性和直接的方法来确定分配特定预测标签所需的WSI的空间区域。我们在乳腺癌计算病理学领域展示了WEEP。WEEP易于实现,与基于模型的决策过程直接连接,并提供了与研究和诊断应用相关的信息。
https://arxiv.org/abs/2403.15238
3D instance segmentation (3DIS) is a crucial task, but point-level annotations are tedious in fully supervised settings. Thus, using bounding boxes (bboxes) as annotations has shown great potential. The current mainstream approach is a two-step process, involving the generation of pseudo-labels from box annotations and the training of a 3DIS network with the pseudo-labels. However, due to the presence of intersections among bboxes, not every point has a determined instance label, especially in overlapping areas. To generate higher quality pseudo-labels and achieve more precise weakly supervised 3DIS results, we propose the Box-Supervised Simulation-assisted Mean Teacher for 3D Instance Segmentation (BSNet), which devises a novel pseudo-labeler called Simulation-assisted Transformer. The labeler consists of two main components. The first is Simulation-assisted Mean Teacher, which introduces Mean Teacher for the first time in this task and constructs simulated samples to assist the labeler in acquiring prior knowledge about overlapping areas. To better model local-global structure, we also propose Local-Global Aware Attention as the decoder for teacher and student labelers. Extensive experiments conducted on the ScanNetV2 and S3DIS datasets verify the superiority of our designs. Code is available at \href{this https URL}{this https URL}.
3D实例分割(3DIS)是一个关键的任务,但在完全监督的环境中,点级注释是乏味的。因此,使用边界框(bboxes)作为注释具有很大的潜力。当前的主流方法是一个两步过程,包括从框注释中生成伪标签和用伪标签训练3DIS网络。然而,由于框之间的交叠,不是每个点都有确定的实例标签,特别是在重叠区域。为了生成更高质量的伪标签,并实现更精确的弱监督3DIS结果,我们提出了Box-Supervised Simulation-assisted Mean Teacher for 3D Instance Segmentation (BSNet),它设计了一个新的伪标签器,称为Simulation-assisted Transformer。标签器由两个主要组件组成。第一个是Simulation-assisted Mean Teacher,这是在这个任务中首次引入Mean Teacher,并构建了模拟样本来帮助标签器获取关于重叠区域的先验知识。为了更好地建模局部到全局的结构,我们还提出了Local-Global Aware Attention作为学生标签器的解码器。在ScanNetV2和S3DIS数据集上进行的大量实验证实了我们的设计优越性。代码可在此处访问:\href{this https URL}{this https URL}。
https://arxiv.org/abs/2403.15019
Addressing the challenge of high annotation costs in solving Math Word Problems (MWPs) through full supervision with intermediate equations, recent works have proposed weakly supervised task settings that rely solely on the final answer as a supervised signal. Existing leading approaches typically employ various search techniques to infer intermediate equations, but cannot ensure their semantic consistency with natural language descriptions. The rise of Large Language Models (LLMs) like ChatGPT has opened up new possibilities for addressing MWPs directly. However, the computational demands of LLMs make them less than ideal for use in settings where resources are tight. In light of these challenges, we introduce an innovative two-stage framework that adeptly transfers mathematical Expertise from large to tiny language models. In \emph{Distillation Stage}, we propose a series of extraction processes that satisfy the properties of MWPs to distill mathematical knowledge from LLMs to construct problem-equation pairs required for supervised training. In \emph{Refinement Stage}, Due to Knowledge distilling method cannot guarantee the full utilization of all data, we further utilize the unsuccessfully searched data effectively by Knowledge Refine method. Finally, We train a small model using distilled data generated through two-stage methods. As our method fully leverages the semantic understanding capabilities during the searching 'problem-equation' pair, it demonstrates significantly improved performance on the Math23K and Weak12K datasets compared to existing small model methods, while maintaining a much lower computational cost than ChatGPT.
解决数学 Word 问题(MWPs)的高注释成本的挑战,通过全监督且包含中间方程的完整监督方式,最近的工作提出了弱监督任务设置,仅依赖最终答案作为监督信号。现有的主导方法通常采用各种搜索技术推断中间方程,但不能确保其与自然语言描述的语义一致性。大型语言模型(LLMs)如ChatGPT的出现为直接解决MWPs提供了新的可能性。然而,LLMs的计算需求使得它们在资源紧张的环境下效果不佳。鉴于这些挑战,我们引入了一种创新的两阶段框架,将数学专业知识从大语言模型转移至小语言模型。在\emph{提取阶段},我们提出了一系列满足MWP性质的提取过程,从LLMs中提取数学知识并构建问题等价物以进行监督训练。在\emph{优化阶段}中,由于知识蒸馏方法不能保证所有数据的全利用,我们通过知识优化方法有效地利用了失败搜索的数据。最后,我们使用通过两阶段方法生成的经过蒸馏的数据来训练小模型。由于我们的方法在搜索“问题等价物”时完全利用了语义理解能力,在Math23K和Weak12K数据集上的表现比现有小模型方法显著提高,同时保持比ChatGPT更低的计算成本。
https://arxiv.org/abs/2403.14390
Vision-centric 3D environment understanding is both vital and challenging for autonomous driving systems. Recently, object-free methods have attracted considerable attention. Such methods perceive the world by predicting the semantics of discrete voxel grids but fail to construct continuous and accurate obstacle surfaces. To this end, in this paper, we propose SurroundSDF to implicitly predict the signed distance field (SDF) and semantic field for the continuous perception from surround images. Specifically, we introduce a query-based approach and utilize SDF constrained by the Eikonal formulation to accurately describe the surfaces of obstacles. Furthermore, considering the absence of precise SDF ground truth, we propose a novel weakly supervised paradigm for SDF, referred to as the Sandwich Eikonal formulation, which emphasizes applying correct and dense constraints on both sides of the surface, thereby enhancing the perceptual accuracy of the surface. Experiments suggest that our method achieves SOTA for both occupancy prediction and 3D scene reconstruction tasks on the nuScenes dataset.
视觉中心化的3D环境理解对于自动驾驶系统来说至关重要,同时也具有挑战性。最近,无对象方法吸引了相当的关注。这些方法通过预测离散体素网格的语义来感知世界,但无法构建连续和准确的障碍表面。为此,在本文中,我们提出SurroundSDF来预测从周围图像中隐含的连续感知 signed distance field(SDF)和语义场。具体来说,我们引入了基于查询的方法,并利用 Eikonal 公式对约束的 SDF 来准确描述障碍物的表面。此外,考虑到缺乏精确的 SDF 真实值,我们提出了一个称为 Sandwich Eikonal 公式的弱监督范式,该范式在两端应用正确的和密集约束,从而提高表面感知的准确性。实验结果表明,我们的方法在 nuScenes 数据集上的 occupancy 预测和3D 场景重构任务上取得了最先进的水平。
https://arxiv.org/abs/2403.14366
Recently, One-stage Weakly Supervised Semantic Segmentation (WSSS) with image-level labels has gained increasing interest due to simplification over its cumbersome multi-stage counterpart. Limited by the inherent ambiguity of Class Activation Map (CAM), we observe that one-stage pipelines often encounter confirmation bias caused by incorrect CAM pseudo-labels, impairing their final segmentation performance. Although recent works discard many unreliable pseudo-labels to implicitly alleviate this issue, they fail to exploit sufficient supervision for their models. To this end, we propose a dual student framework with trustworthy progressive learning (DuPL). Specifically, we propose a dual student network with a discrepancy loss to yield diverse CAMs for each sub-net. The two sub-nets generate supervision for each other, mitigating the confirmation bias caused by learning their own incorrect pseudo-labels. In this process, we progressively introduce more trustworthy pseudo-labels to be involved in the supervision through dynamic threshold adjustment with an adaptive noise filtering strategy. Moreover, we believe that every pixel, even discarded from supervision due to its unreliability, is important for WSSS. Thus, we develop consistency regularization on these discarded regions, providing supervision of every pixel. Experiment results demonstrate the superiority of the proposed DuPL over the recent state-of-the-art alternatives on PASCAL VOC 2012 and MS COCO datasets. Code is available at this https URL.
近年来,随着One-stage Weakly Supervised Semantic Segmentation (WSSS)通过图像级标签简化了其繁琐的多阶段 counterparts,这种方法受到了越来越多的关注。然而,由于类别激活图(CAM)固有的不确定性,我们观察到一种一阶段管道经常受到由于错误CAM伪标签而引起的确认偏差,从而降低了其最终分割性能。尽管最近的工作舍弃了许多不可靠的伪标签来 implicitly 缓解这个问题,但它们没有充分利用足够的监督来提高模型。因此,我们提出了一个双向学生框架DuPL。具体来说,我们提出了一个具有差异损失的双学生网络,用于生成每个子网的多样化CAM。这两个子网络相互提供监督,减轻了由于学习自己的错误伪标签而引起的确认偏差。在这个过程中,我们通过动态阈值调整采用自适应噪声滤波策略逐步引入更多可信的伪标签。此外,我们认为每个像素,即使由于不可靠而被舍弃,也是WSSS中不可或缺的一部分。因此,我们在这些被舍弃的区域内开发了一致性正则化,为每个像素提供监督。实验结果表明,与最近的先进技术相比,DuPL具有优越性。代码可以从该链接获取。
https://arxiv.org/abs/2403.11184
Exploring and mining subtle yet distinctive features between sub-categories with similar appearances is crucial for fine-grained visual categorization (FGVC). However, less effort has been devoted to assessing the quality of extracted visual representations. Intuitively, the network may struggle to capture discriminative features from low-quality samples, which leads to a significant decline in FGVC performance. To tackle this challenge, we propose a weakly supervised Context-Semantic Quality Awareness Network (CSQA-Net) for FGVC. In this network, to model the spatial contextual relationship between rich part descriptors and global semantics for capturing more discriminative details within the object, we design a novel multi-part and multi-scale cross-attention (MPMSCA) module. Before feeding to the MPMSCA module, the part navigator is developed to address the scale confusion problems and accurately identify the local distinctive regions. Furthermore, we propose a generic multi-level semantic quality evaluation module (MLSQE) to progressively supervise and enhance hierarchical semantics from different levels of the backbone network. Finally, context-aware features from MPMSCA and semantically enhanced features from MLSQE are fed into the corresponding quality probing classifiers to evaluate their quality in real-time, thus boosting the discriminability of feature representations. Comprehensive experiments on four popular and highly competitive FGVC datasets demonstrate the superiority of the proposed CSQA-Net in comparison with the state-of-the-art methods.
探索并挖掘子分类中类似但细微的特征对于细粒度视觉分类(FGVC)至关重要。然而,用于提取视觉表示的精力相对较少。直观地,网络可能很难从低质量样本中提取出有用的特征,导致FGVC性能显著下降。为解决这个问题,我们提出了一个弱监督的上下文语义质量意识网络(CSQA-Net)用于FGVC。在网络中,为了在对象中捕捉更细致的语义细节,我们设计了一个新颖的多部分多尺度 cross-attention (MPMSCA) 模块。在将输入传递给 MPMSCA 模块之前,我们开发了部分导航器来解决尺度混淆问题,并准确地识别出局部特有的区域。此外,我们提出了一个通用的多级语义质量评估模块(MLSQE)来逐步监督和增强网络基础知识中的层次语义。最后,将 MPMSCA 和 MLSQE 的上下文感知特征输入到相应的质量探针分类器中,以实时评估它们的质量,从而提高特征表示的歧视性。在四个受欢迎且具有高度竞争性的 FGVC 数据集上进行全面的实验证明,与最先进的方法相比,所提出的 CSQA-Net 具有优越性。
https://arxiv.org/abs/2403.10298
Weakly supervised surgical instrument segmentation with only instrument presence labels has been rarely explored in surgical domain. To mitigate the highly under-constrained challenges, we extend a two-stage weakly supervised segmentation paradigm with temporal attributes from two perspectives. From a temporal equivariance perspective, we propose a prototype-based temporal equivariance regulation loss to enhance pixel-wise consistency between adjacent features. From a semantic continuity perspective, we propose a class-aware temporal semantic continuity loss to constrain the semantic consistency between a global view of target frame and local non-discriminative regions of adjacent reference frame. To the best of our knowledge, WeakSurg is the first instrument-presence-only weakly supervised segmentation architecture to take temporal information into account for surgical scenarios. Extensive experiments are validated on Cholec80, an open benchmark for phase and instrument recognition. We annotate instance-wise instrument labels with fixed time-steps which are double checked by a clinician with 3-years experience. Our results show that WeakSurg compares favorably with state-of-the-art methods not only on semantic segmentation metrics but also on instance segmentation metrics.
在手术领域中,我们很少对仅凭器械存在标签的弱监督手术器械分割进行研究。为了减轻高度约束力的问题,我们在两个角度扩展了两级弱监督分割范式。从时间等价性角度来看,我们提出了一个基于原型的时间等价性调节损失,以增强相邻特征之间的像素级一致性。从语义连续性角度来看,我们提出了一个类感知的时间语义连续性损失,以约束目标帧全局视图与相邻参考帧局部非区分区域之间的语义一致性。据我们所知,WeakSurg是第一个考虑了手术场景中时间信息的仅凭器械存在标签的弱监督分割架构。我们在Cholec80(用于阶段和器械识别的开放基准)上进行了大量实验验证。我们通过具有3年临床经验的专业医生对实例逐个进行标注的时间步长。我们的结果表明,WeakSurg不仅在语义分割指标上与最先进的方法相当,而且在实例分割指标上也表现出了竞争力。
https://arxiv.org/abs/2403.09551
Recent weakly supervised semantic segmentation (WSSS) methods strive to incorporate contextual knowledge to improve the completeness of class activation maps (CAM). In this work, we argue that the knowledge bias between instances and contexts affects the capability of the prototype to sufficiently understand instance semantics. Inspired by prototype learning theory, we propose leveraging prototype awareness to capture diverse and fine-grained feature attributes of instances. The hypothesis is that contextual prototypes might erroneously activate similar and frequently co-occurring object categories due to this knowledge bias. Therefore, we propose to enhance the prototype representation ability by mitigating the bias to better capture spatial coverage in semantic object regions. With this goal, we present a Context Prototype-Aware Learning (CPAL) strategy, which leverages semantic context to enrich instance comprehension. The core of this method is to accurately capture intra-class variations in object features through context-aware prototypes, facilitating the adaptation to the semantic attributes of various instances. We design feature distribution alignment to optimize prototype awareness, aligning instance feature distributions with dense features. In addition, a unified training framework is proposed to combine label-guided classification supervision and prototypes-guided self-supervision. Experimental results on PASCAL VOC 2012 and MS COCO 2014 show that CPAL significantly improves off-the-shelf methods and achieves state-of-the-art performance. The project is available at this https URL.
近年来,弱监督语义分割(WSSS)方法试图引入上下文知识来提高分类激活图(CAM)的完整性。在本文中,我们认为实例和上下文之间的知识偏差会影响原型对实例语义的理解能力。受到原型学习理论的启发,我们提出了利用原型意识捕捉实例多样性和细粒度特征属性的建议。假设是,由于知识偏差,上下文原型可能会错误地激活相似且频繁共同出现的物体类别。因此,我们提出了一种通过减轻偏差来更好地捕捉语义物体区域的空间覆盖的方法,即增强原型表示能力。为实现这一目标,我们提出了一个上下文原型感知学习(CPAL)策略,它利用语义上下文来丰富实例理解。这一方法的核心是通过语境感知的原型准确捕捉类内特征的差异,促进适应各种实例的语义属性。我们设计了一个特征分布对齐来优化原型意识,将实例特征分布与密集特征对齐。此外,我们还提出了一个统一训练框架,将标签指导分类监督和原型指导自监督相结合。在PASCAL VOC 2012和MS COCO 2014数据集上的实验结果表明,CPAL显著提高了备选方法,并达到了最先进的性能水平。该项目现在可以在以下链接处找到:https://www.acm.org/dl/d/2022.01.02013126000000/
https://arxiv.org/abs/2403.07630
Recently, convolutional neural networks (CNNs) with large size kernels have attracted much attention in the computer vision field, following the success of the Vision Transformers. Large kernel CNNs have been reported to perform well in downstream vision tasks as well as in classification performance. The reason for the high-performance of large kernel CNNs in downstream tasks has been attributed to the large effective receptive field (ERF) produced by large size kernels, but this view has not been fully tested. We therefore revisit the performance of large kernel CNNs in downstream task, focusing on the weakly supervised object localization (WSOL) task. WSOL, a difficult downstream task that is not fully supervised, provides a new angle to explore the capabilities of the large kernel CNNs. Our study compares the modern large kernel CNNs ConvNeXt, RepLKNet, and SLaK to test the validity of the naive expectation that ERF size is important for improving downstream task performance. Our analysis of the factors contributing to high performance provides a different perspective, in which the main factor is feature map improvement. Furthermore, we find that modern CNNs are robust to the CAM problems of local regions of objects being activated, which has long been discussed in WSOL. CAM is the most classic WSOL method, but because of the above-mentioned problems, it is often used as a baseline method for comparison. However, experiments on the CUB-200-2011 dataset show that simply combining a large kernel CNN, CAM, and simple data augmentation methods can achieve performance (90.99% MaxBoxAcc) comparable to the latest WSOL method, which is CNN-based and requires special training or complex post-processing. The code is available at this https URL.
最近,在计算机视觉领域,具有大尺寸核的卷积神经网络(CNNs)吸引了大量关注,这是随着Vision Transformers的成功而实现的。已经报道过大尺寸CNN在下游视觉任务中的表现以及分类性能都很好。大尺寸CNN在下游任务中表现优秀的原因归功于大尺寸核产生的较大有效感受野(ERF),但这种观点还没有完全得到验证。因此,我们重新审视了大尺寸CNN在下游任务中的表现,重点关注弱监督物体局部定位(WSOL)任务。WSOL是一个困难的不完全监督下游任务,为我们探索大尺寸CNN的能力提供了新的角度。我们的研究将现代大尺寸CNN ConvNeXt、RepLKNet和SLaK与最新的WSOL方法进行了比较,以验证是否符合人们的期望,即ERF大小对下游任务性能的提高很重要。我们的分析提供了不同的观点,主要因素是特征图的改进。此外,我们发现现代CNN对物体激活区域的小区域CAM问题具有鲁棒性,这与WSOL中讨论的问题相似。CAM是最经典的WSOL方法,但由于上述问题,通常被用作比较基线方法。然而,在CUB-200-2011数据集上的实验表明,仅将大尺寸CNN、CAM和简单的数据增强方法结合,可以实现与最新WSOL方法相当的表现(90.99% MaxBoxAcc),这是基于CNN的,需要特殊训练或复杂的后处理。代码可在此处获得:https://www.xively.cn/。
https://arxiv.org/abs/2403.06676
In recent years, video anomaly detection has been extensively investigated in both unsupervised and weakly supervised settings to alleviate costly temporal labeling. Despite significant progress, these methods still suffer from unsatisfactory results such as numerous false alarms, primarily due to the absence of precise temporal anomaly annotation. In this paper, we present a novel labeling paradigm, termed "glance annotation", to achieve a better balance between anomaly detection accuracy and annotation cost. Specifically, glance annotation is a random frame within each abnormal event, which can be easily accessed and is cost-effective. To assess its effectiveness, we manually annotate the glance annotations for two standard video anomaly detection datasets: UCF-Crime and XD-Violence. Additionally, we propose a customized GlanceVAD method, that leverages gaussian kernels as the basic unit to compose the temporal anomaly distribution, enabling the learning of diverse and robust anomaly representations from the glance annotations. Through comprehensive analysis and experiments, we verify that the proposed labeling paradigm can achieve an excellent trade-off between annotation cost and model performance. Extensive experimental results also demonstrate the effectiveness of our GlanceVAD approach, which significantly outperforms existing advanced unsupervised and weakly supervised methods. Code and annotations will be publicly available at this https URL.
近年来,在无监督和弱监督设置中,视频异常检测得到了广泛研究,以减轻昂贵的时标标注。尽管取得了显著的进展,但这些方法仍然存在不满意的成果,主要原因是缺乏精确的时标异常标注。在本文中,我们提出了一个名为“ glance 注释”的新标签范式,以实现异常检测准确性和标注成本之间的更好平衡。具体来说, glance 注释是在每个异常事件中的随机帧,可以轻松访问并且成本效益高。为了评估其有效性,我们手动标注了两个标准的视频异常检测数据集:UCF-Crime和XD-Violence。此外,我们提出了一个自定义的GlanceVAD方法,它利用高斯核作为基本单元来构建时标异常分布,从而实现从 glance 注释中学习多样化和稳健的异常表示。通过全面的分析和实验,我们证实了所提出的标签范式可以在标注成本和模型性能之间实现出色的平衡。广泛的实验结果还证明了我们的GlanceVAD方法的有效性,它显著优于现有的高级无监督和弱监督方法。代码和注释将公开发布在https://这个 URL上。
https://arxiv.org/abs/2403.06154
Change detection, which aims to detect spatial changes from a pair of multi-temporal images due to natural or man-made causes, has been widely applied in remote sensing, disaster management, urban management, etc. Most existing change detection approaches, however, are fully supervised and require labor-intensive pixel-level labels. To address this, we develop a novel weakly supervised change detection technique via Knowledge Distillation and Multiscale Sigmoid Inference (KD-MSI) that leverages image-level labels. In our approach, the Class Activation Maps (CAM) are utilized not only to derive a change probability map but also to serve as a foundation for the knowledge distillation process. This is done through a joint training strategy of the teacher and student networks, enabling the student network to highlight potential change areas more accurately than teacher network based on image-level labels. Moreover, we designed a Multiscale Sigmoid Inference (MSI) module as a post processing step to further refine the change probability map from the trained student network. Empirical results on three public datasets, i.e., WHU-CD, DSIFN-CD, and LEVIR-CD, demonstrate that our proposed technique, with its integrated training strategy, significantly outperforms the state-of-the-art.
变化检测,旨在从多时态图像的一对中检测自然或人为原因引起的空间变化,已经在遥感、灾害管理、城市管理等领域得到了广泛应用。然而,现有的变化检测方法大多是半监督的,需要大量劳动密集型的像素级别标签。为了解决这个问题,我们通过知识蒸馏和多尺度 sigmoid 推理(KD-MSI)开发了一种新颖的弱监督变化检测技术,利用了图像级别标签。在我们的方法中,类激活图(CAM)被不仅用于计算变化概率图,还用作知识蒸馏过程的基底。这是通过教师和学生网络的联合训练策略实现的,使学生网络能够根据图像级别标签更准确地突出潜在变化区域。此外,我们还设计了一个多尺度 sigmoid 推理(MSI)模块作为后处理步骤,进一步对训练好的学生网络的变更概率图进行微调。来自三个公开数据集(即 WHU-CD、DSIFN-CD 和 LEVIR-CD)的实验结果表明,我们提出的方法,结合了集成训练策略,在现有技术水平上显著取得了优势。
https://arxiv.org/abs/2403.05796
Artificial Intelligence (AI) has great potential to improve health outcomes by training systems on vast digitized clinical datasets. Computational Pathology, with its massive amounts of microscopy image data and impact on diagnostics and biomarkers, is at the forefront of this development. Gigapixel pathology slides pose a unique challenge due to their enormous size and are usually divided into tens of thousands of smaller tiles for analysis. This results in a discontinuity in the machine learning process by separating the training of tile-level encoders from slide-level aggregators and the need to adopt weakly supervised learning strategies. Training models from entire pathology slides end-to-end has been largely unexplored due to its computational challenges. To overcome this problem, we propose a novel approach to jointly train both a tile encoder and a slide-aggregator fully in memory and end-to-end at high-resolution, bridging the gap between input and slide-level supervision. While more computationally expensive, detailed quantitative validation shows promise for large-scale pre-training of pathology foundation models.
人工智能(AI)通过在庞大的数字临床数据集上训练系统,具有巨大的提高健康成果潜力。计算病理学在这个发展过程中处于领先地位,因为它处理了大量微观镜检图像数据,对诊断和生物标记物产生重大影响。巨型像素病理切片 poses 独特的挑战,因为它们的大小巨大,通常被分为成千上万个较小的片段进行分析。这导致在机器学习过程中分离开来,将层级的编码器从片级的聚合器中分离,并需要采用弱监督学习策略。由于其计算难题,对完整病理切片端到端训练模型的探索仍然很少。为了克服这个问题,我们提出了一种在内存中与合作训练一个 tile 编码器和一个片级聚合器,实现端到端的高分辨率训练的新方法,桥接了输入和片级监督之间的差距。虽然计算成本更高,但详细定量验证显示,在病理基础模型的大规模预训练方面具有潜在的前景。
https://arxiv.org/abs/2403.04865
Deep Learning models have been successfully utilized to extract clinically actionable insights from routinely available histology data. Generally, these models require annotations performed by clinicians, which are scarce and costly to generate. The emergence of self-supervised learning (SSL) methods remove this barrier, allowing for large-scale analyses on non-annotated data. However, recent SSL approaches apply increasingly expansive model architectures and larger datasets, causing the rapid escalation of data volumes, hardware prerequisites, and overall expenses, limiting access to these resources to few institutions. Therefore, we investigated the complexity of contrastive SSL in computational pathology in relation to classification performance with the utilization of consumer-grade hardware. Specifically, we analyzed the effects of adaptations in data volume, architecture, and algorithms on downstream clas- sification tasks, emphasizing their impact on computational resources. We trained breast cancer foundation models on a large public patient cohort and validated them on various downstream classification tasks in a weakly supervised manner on two external public patient cohorts. Our experiments demonstrate that we can improve downstream classification performance whilst reducing SSL training duration by 90%. In summary, we propose a set of adaptations which enable the utilization of SSL in computational pathology in non-resource abundant environments.
深度学习模型已被成功应用于从常规可用的组织学数据中提取具有临床价值的见解。通常,这些模型需要由临床医生执行的注释,这些注释稀缺且昂贵。自监督学习(SSL)方法的出现消除了这个障碍,允许在未标注的数据上进行大规模分析。然而,最近的 SSL 方法越来越多地应用更广泛的模型架构和更大的数据集,导致数据量的迅速增长、硬件需求的高昂以及整体费用的增加,限制了这些资源的访问仅限于少数机构。因此,我们研究了在计算病理学中对比性 SSL 与分类性能的关系,以及在利用消费者级硬件时这种资源复杂性。具体来说,我们分析了数据量、架构和算法的调整对下游分类任务的影响,重点关注其对计算资源的影响。我们在一个大型的公开患者队列上训练了乳腺癌基础模型,并用两个外部公开患者队列以弱监督方式验证了它们。我们的实验表明,我们可以在降低 SSL 训练时间90%的同时提高下游分类性能。总之,我们提出了一组调整,使 SSL 在非资源丰富环境中在计算病理学中得到充分利用。
https://arxiv.org/abs/2403.04558