Today's open vocabulary scene graph generation (OVSGG) extends traditional SGG by recognizing novel objects and relationships beyond predefined categories, leveraging the knowledge from pre-trained large-scale models. Most existing methods adopt a two-stage pipeline: weakly supervised pre-training with image captions and supervised fine-tuning (SFT) on fully annotated scene graphs. Nonetheless, they omit explicit modeling of interacting objects and treat all objects equally, resulting in mismatched relation pairs. To this end, we propose an interaction-aware OVSGG framework INOVA. During pre-training, INOVA employs an interaction-aware target generation strategy to distinguish interacting objects from non-interacting ones. In SFT, INOVA devises an interaction-guided query selection tactic to prioritize interacting objects during bipartite graph matching. Besides, INOVA is equipped with an interaction-consistent knowledge distillation to enhance the robustness by pushing interacting object pairs away from the background. Extensive experiments on two benchmarks (VG and GQA) show that INOVA achieves state-of-the-art performance, demonstrating the potential of interaction-aware mechanisms for real-world applications.
今天的开放词汇场景图生成(OVSGG)通过识别超出预定义类别范畴的新颖对象和关系,扩展了传统的场景图生成(SGG),并利用大规模预训练模型的知识。大多数现有方法采用两阶段管道:使用图像字幕进行弱监督预训练,并在完全标注的场景图上进行有监督微调(SFT)。然而,它们忽略了交互物体的显式建模,并且对待所有物体一视同仁,导致关系配对不匹配的问题。为此,我们提出了一种感知互动的OVSGG框架INOVA。 在预训练阶段,INOVA采用一种感知互动的目标生成策略来区分相互作用的物体和非相互作用的物体。在SFT过程中,INOVA设计了基于交互引导的查询选择策略,在二分图匹配中优先考虑相互作用的物体。此外,INOVA配备了具有交互一致性的知识蒸馏方法,通过将相互作用的物体会对推开背景来增强模型的鲁棒性。 在两个基准数据集(VG和GQA)上的广泛实验表明,INOVA达到了最先进的性能水平,展示了感知互动机制在现实世界应用中的潜力。
https://arxiv.org/abs/2502.03856
The recent advancement of speech recognition technology has been driven by large-scale datasets and attention-based architectures, but many challenges still remain, especially for low-resource languages and dialects. This paper explores the integration of weakly supervised transcripts from TV subtitles into automatic speech recognition (ASR) systems, aiming to improve both verbatim transcriptions and automatically generated subtitles. To this end, verbatim data and subtitles are regarded as different domains or languages, due to their distinct characteristics. We propose and compare several end-to-end architectures that are designed to jointly model both modalities with separate or shared encoders and decoders. The proposed methods are able to jointly generate a verbatim transcription and a subtitle. Evaluation on Flemish (Belgian Dutch) demonstrates that a model with cascaded encoders and separate decoders allows to represent the differences between the two data types most efficiently while improving on both domains. Despite differences in domain and linguistic variations, combining verbatim transcripts with subtitle data leads to notable ASR improvements without the need for extensive preprocessing. Additionally, experiments with a large-scale subtitle dataset show the scalability of the proposed approach. The methods not only improve ASR accuracy but also generate subtitles that closely match standard written text, offering several potential applications.
近期,语音识别技术的进步主要得益于大规模数据集和基于注意力机制的架构的发展,但仍有许多挑战亟待解决,尤其是在资源有限的语言和方言领域。本文探讨了将来自电视字幕的弱监督转录本集成到自动语音识别(ASR)系统中的方法,旨在提升逐字记录与自动生成字幕的质量。为此,我们将逐字数据和字幕视为不同领域的语言或具有独特特征的数据类型。 我们提出并比较了几种端到端架构,这些架构被设计为同时建模这两种模式,并且使用了单独的或共享的编码器和解码器来处理不同的输入形式。所提出的这种方法能够同时生成逐字转录和字幕。 在弗拉芒语(比利时荷兰语)上的评估表明,采用级联编码器和独立解码器结构的模型最有效地表征两种数据类型之间的差异,并且在这两个领域都表现出改进的效果。尽管不同领域的语言变化存在显著区别,但将逐字转录本与字幕数据相结合,在无需大量预处理的情况下就能实现ASR性能的重大提升。 此外,大规模字幕数据集上的实验表明了所提出方法的可扩展性。这些技术不仅提高了自动语音识别的准确性,还生成了接近标准书面文本质量的字幕,为多种潜在应用提供了可能。
https://arxiv.org/abs/2502.03212
In industrial settings, weakly supervised (WS) methods are usually preferred over their fully supervised (FS) counterparts as they do not require costly manual annotations. Unfortunately, the segmentation masks obtained in the WS regime are typically poor in terms of accuracy. In this work, we present a WS method capable of producing accurate masks for semantic segmentation in the case of video streams. More specifically, we build saliency maps that exploit the temporal coherence between consecutive frames in a video, promoting consistency when objects appear in different frames. We apply our method in a waste-sorting scenario, where we perform weakly supervised video segmentation (WSVS) by training an auxiliary classifier that distinguishes between videos recorded before and after a human operator, who manually removes specific wastes from a conveyor belt. The saliency maps of this classifier identify materials to be removed, and we modify the classifier training to minimize differences between the saliency map of a central frame and those in adjacent frames, after having compensated object displacement. Experiments on a real-world dataset demonstrate the benefits of integrating temporal coherence directly during the training phase of the classifier. Code and dataset are available upon request.
在工业环境中,弱监督(WS)方法通常比完全监督(FS)方法更受欢迎,因为它们不需要昂贵的手动标注。不幸的是,在弱监督模式下获得的分割掩码往往在准确性方面较差。在这项工作中,我们提出了一种能够在视频流中生成准确掩码以进行语义分割的弱监督方法。具体来说,我们构建了利用视频连续帧之间的时间连贯性的显著图,当对象出现在不同帧时,这种方法能够促进一致性。 我们在一个废物分类场景中应用该方法,通过训练辅助分类器来执行弱监督视频分割(WSVS),这个分类器可以区分操作员手动从传送带上移除特定废物前后录制的视频。此分类器生成的显著图会识别需要被移除的材料,并且我们修改了分类器的训练过程以最小化中心帧与相邻帧之间的显著图差异,同时补偿对象位移。 在真实世界数据集上的实验表明,在分类器的训练阶段直接整合时间连贯性可以带来益处。代码和数据集可应要求提供。
https://arxiv.org/abs/2502.01455
3D visual grounding (3DVG) is challenging because of the requirement of understanding on visual information, language and spatial relationships. While supervised approaches have achieved superior performance, they are constrained by the scarcity and high cost of 3D vision-language datasets. On the other hand, LLM/VLM based agents are proposed for 3DVG, eliminating the need for training data. However, these methods incur prohibitive time and token costs during inference. To address the challenges, we introduce a novel training-free symbolic framework for 3D visual grounding, namely Evolvable Symbolic Visual Grounder, that offers significantly reduced inference costs compared to previous agent-based methods while maintaining comparable performance. EaSe uses LLM generated codes to compute on spatial relationships. EaSe also implements an automatic pipeline to evaluate and optimize the quality of these codes and integrate VLMs to assist in the grounding process. Experimental results demonstrate that EaSe achieves 52.9% accuracy on Nr3D dataset and 49.2% Acc@0.25 on ScanRefer, which is top-tier among training-free methods. Moreover, it substantially reduces the inference time and cost, offering a balanced trade-off between performance and efficiency. Codes are available at this https URL.
3D视觉定位(3DVG)的挑战在于需要理解视觉信息、语言和空间关系。虽然监督学习方法已经取得了卓越的表现,但它们受限于缺乏且昂贵的3D视觉-语言数据集。另一方面,基于LLM/VLM的代理被提出用于3DVG,消除了对训练数据的需求。然而,这些方法在推理阶段会产生高昂的时间成本和标记成本。为了解决这些问题,我们引入了一种新颖的无需训练的符号框架——可进化符号视觉定位器(Evolvable Symbolic Visual Grounder),该框架相比之前的代理方法大大减少了推理成本,并且保持了相当的性能水平。EaSe使用LLM生成的代码来计算空间关系。同时,EaSe还实现了自动管道以评估和优化这些代码的质量,并整合VLM以辅助定位过程。实验结果表明,EaSe在Nr3D数据集上达到了52.9%的准确性,在ScanRefer数据集上达到了49.2%的Acc@0.25得分,这在无需训练的方法中处于顶级水平。此外,它显著降低了推理时间和成本,提供了性能和效率之间的平衡。代码可在以下网址获取:[请在此处插入实际链接]。
https://arxiv.org/abs/2502.01401
Whole Slide Imaging (WSI), which involves high-resolution digital scans of pathology slides, has become the gold standard for cancer diagnosis, but its gigapixel resolution and the scarcity of annotated datasets present challenges for deep learning models. Multiple Instance Learning (MIL), a widely-used weakly supervised approach, bypasses the need for patch-level annotations. However, conventional MIL methods overlook the spatial relationships between patches, which are crucial for tasks such as cancer grading and diagnosis. To address this, graph-based approaches have gained prominence by incorporating spatial information through node connections. Despite their potential, both MIL and graph-based models are vulnerable to learning spurious associations, like color variations in WSIs, affecting their robustness. In this dissertation, we conduct an extensive comparison of multiple graph construction techniques, MIL models, graph-MIL approaches, and interventional training, introducing a new framework, Graph-based Multiple Instance Learning with Interventional Training (GMIL-IT), for WSI classification. We evaluate their impact on model generalization through domain shift analysis and demonstrate that graph-based models alone achieve the generalization initially anticipated from interventional training. Our code is available here: this http URL
全切片成像(WSI,Whole Slide Imaging)涉及病理切片的高分辨率数字扫描,在癌症诊断中已成为金标准。然而,其千兆像素级别的分辨率和标注数据集的稀缺性为深度学习模型带来了挑战。多实例学习(MIL,Multiple Instance Learning),一种广泛使用的弱监督方法,可以绕过对补丁级别注释的需求。但是,传统的MIL方法忽视了补丁之间的空间关系,这对于癌症分级和诊断等任务至关重要。为此,基于图的方法通过节点连接来纳入空间信息而获得了重要性。尽管这些方法具有潜力,但无论是MIL还是基于图的模型都容易学习到虚假关联(如WSI中的颜色变化),从而影响其鲁棒性。 在本文中,我们对多种图构建技术、MIL模型、图-MIL方法以及介入式训练进行了广泛的比较,并引入了一个新的框架——带有介入式训练的基于图的多实例学习(GMIL-IT)用于WSI分类。通过领域转换分析,我们评估了这些技术对模型泛化能力的影响,并展示了仅依靠基于图的方法就能达到原本预期由介入式训练带来的泛化效果。 我们的代码在此提供:[此处应插入具体的URL链接] 这篇摘要详细介绍了如何将不同方法结合应用于WSI分类问题上,通过比较和引入新的框架来解决当前深度学习模型在处理高分辨率病理图像时的挑战,并展示了基于图的方法及其与介入式训练相结合的有效性。
https://arxiv.org/abs/2501.19048
In this work, we focus on Weakly Supervised Spatio-Temporal Video Grounding (WSTVG). It is a multimodal task aimed at localizing specific subjects spatio-temporally based on textual queries without bounding box supervision. Motivated by recent advancements in multi-modal foundation models for grounding tasks, we first explore the potential of state-of-the-art object detection models for WSTVG. Despite their robust zero-shot capabilities, our adaptation reveals significant limitations, including inconsistent temporal predictions, inadequate understanding of complex queries, and challenges in adapting to difficult scenarios. We propose CoSPaL (Contextual Self-Paced Learning), a novel approach which is designed to overcome these limitations. CoSPaL integrates three core components: (1) Tubelet Phrase Grounding (TPG), which introduces spatio-temporal prediction by linking textual queries to tubelets; (2) Contextual Referral Grounding (CRG), which improves comprehension of complex queries by extracting contextual information to refine object identification over time; and (3) Self-Paced Scene Understanding (SPS), a training paradigm that progressively increases task difficulty, enabling the model to adapt to complex scenarios by transitioning from coarse to fine-grained understanding.
在这项工作中,我们专注于弱监督时空视频定位(WSTVG)。这是一个多模态任务,旨在基于文本查询在没有边界框监督的情况下对特定对象进行时空定位。受到最近关于用于定位任务的多模式基础模型进展的启发,我们首先探索了最先进的物体检测模型在WSTVG中的潜力。尽管它们具备强大的零样本能力,但我们的适应性实验揭示了显著的局限性,包括不一致的时间预测、复杂查询理解不足以及应对困难场景时的挑战。 为此,我们提出了CoSPaL(Contextual Self-Paced Learning),一种旨在克服这些限制的新方法。CoSPaL集成了三个核心组件: 1. **管状片段定位 (TPG)**:通过将文本查询链接到管状片段来引入时空预测。 2. **上下文引用定位 (CRG)**:通过提取上下文信息以随时间细化对象识别,改进对复杂查询的理解。 3. **自适应场景理解 (SPS)**:一种逐步增加任务难度的训练范式,使模型能够从粗略到精细地适应复杂场景。 这些组件共同作用,旨在提高WSTVG中时空定位和文本-视频匹配的能力。
https://arxiv.org/abs/2501.17053
Weakly supervised landslide extraction aims to identify landslide regions from remote sensing data using models trained with weak labels, particularly image-level labels. However, it is often challenged by the imprecise boundaries of the extracted objects due to the lack of pixel-wise supervision and the properties of landslide objects. To tackle these issues, we propose a simple yet effective method by auto-prompting the Segment Anything Model (SAM), i.e., APSAM. Instead of depending on high-quality class activation maps (CAMs) for pseudo-labeling or fine-tuning SAM, our method directly yields fine-grained segmentation masks from SAM inference through prompt engineering. Specifically, it adaptively generates hybrid prompts from the CAMs obtained by an object localization network. To provide sufficient information for SAM prompting, an adaptive prompt generation (APG) algorithm is designed to fully leverage the visual patterns of CAMs, enabling the efficient generation of pseudo-masks for landslide extraction. These informative prompts are able to identify the extent of landslide areas (box prompts) and denote the centers of landslide objects (point prompts), guiding SAM in landslide segmentation. Experimental results on high-resolution aerial and satellite datasets demonstrate the effectiveness of our method, achieving improvements of at least 3.0\% in F1 score and 3.69\% in IoU compared to other state-of-the-art methods. The source codes and datasets will be available at this https URL.
弱监督滑坡提取的目标是通过使用仅基于图像级别标签进行训练的模型,从遥感数据中识别出滑坡区域。然而,由于缺乏像素级别的监督和滑坡对象本身的特性,这种方法经常面临所提取物体边界不够精确的问题。为了解决这些问题,我们提出了一种简单而有效的方法:自动提示Segment Anything Model (SAM),即APSAM方法。该方法无需依赖高质量的类激活图(CAMs)进行伪标签标注或微调SAM,而是通过提示工程直接从SAM推理中生成细粒度分割掩码。 具体来说,我们的方法自适应地从对象定位网络获取的CAMs生成混合提示。为了为SAM提示提供足够的信息,我们设计了一种自适应提示生成(APG)算法,利用CAMs中的视觉模式来高效生成用于滑坡提取的伪掩码。这些信息丰富的提示可以确定滑坡区域的范围(框提示)和指示滑坡物体的中心位置(点提示),从而指导SAM进行更准确的滑坡分割。 在高分辨率航空和卫星数据集上的实验结果表明,与现有的其他最先进的方法相比,我们的方法提高了至少3.0%的F1分数和3.69%的交并比(IoU)。相关的源代码和数据集将在这个网址上提供(注:原文中提供了具体链接,请参照原链接获取)。
https://arxiv.org/abs/2501.13426
Weakly supervised object localization (WSOL) using classification models trained with only image-class labels remains an important challenge in computer vision. Given their reliance on classification objectives, traditional WSOL methods like class activation mapping focus on the most discriminative object parts, often missing the full spatial extent. In contrast, recent WSOL methods based on vision-language models like CLIP require ground truth classes or external classifiers to produce a localization map, limiting their deployment in downstream tasks. Moreover, methods like GenPromp attempt to address these issues but introduce considerable complexity due to their reliance on conditional denoising processes and intricate prompt learning. This paper introduces Text Distillation for Localization (TeD-Loc), an approach that directly distills knowledge from CLIP text embeddings into the model backbone and produces patch-level localization. Multiple instance learning of these image patches allows for accurate localization and classification using one model without requiring external classifiers. Such integration of textual and visual modalities addresses the longstanding challenge of achieving accurate localization and classification concurrently, as WSOL methods in the literature typically converge at different epochs. Extensive experiments show that leveraging text embeddings and localization cues provides a cost-effective WSOL model. TeD-Loc improves Top-1 LOC accuracy over state-of-the-art models by about 5% on both CUB and ILSVRC datasets, while significantly reducing computational complexity compared to GenPromp.
弱监督目标定位(WSOL)使用仅基于图像类标签训练的分类模型仍然是计算机视觉中的一个重要挑战。由于传统WSOL方法(如类别激活映射)依赖于分类目标,它们通常关注最具判别性的物体部分,而忽略了完整的空间范围。相比之下,近期基于视觉-语言模型的方法(例如CLIP),需要真实类别或外部分类器来生成定位图,这限制了它们在下游任务中的应用。 此外,像GenPromp这样的方法试图解决这些问题,但由于依赖于复杂的条件去噪过程和精细的提示学习而引入了大量的复杂性。本文介绍了Text Distillation for Localization (TeD-Loc),这是一种直接将CLIP文本嵌入的知识提炼到模型骨干网络中,并生成补丁级别的定位的方法。通过多个实例学习这些图像补丁,可以使用一个模型同时实现准确的定位和分类,而不需要外部分类器。 这种视觉与文本模态的集成解决了长期以来的一个难题:即文献中的WSOL方法通常在不同的阶段收敛于定位和分类精度。广泛的实验表明,利用文本嵌入和定位线索提供了一种成本效益高的WSOL模型。TeD-Loc在CUB和ILSVRC数据集上分别将Top-1 LOC准确率提高了大约5%,同时比GenPromp显著减少了计算复杂度。
https://arxiv.org/abs/2501.12632
Pseudo-label learning methods have been widely applied in weakly-supervised temporal action localization. Existing works directly utilize weakly-supervised base model to generate instance-level pseudo-labels for training the fully-supervised detection head. We argue that the noise in pseudo-labels would interfere with the learning of fully-supervised detection head, leading to significant performance leakage. Issues with noisy labels include:(1) inaccurate boundary localization; (2) undetected short action clips; (3) multiple adjacent segments incorrectly detected as one segment. To target these issues, we introduce a two-stage noisy label learning strategy to harness every potential useful signal in noisy labels. First, we propose a frame-level pseudo-label generation model with a context-aware denoising algorithm to refine the boundaries. Second, we introduce an online-revised teacher-student framework with a missing instance compensation module and an ambiguous instance correction module to solve the short-action-missing and many-to-one problems. Besides, we apply a high-quality pseudo-label mining loss in our online-revised teacher-student framework to add different weights to the noisy labels to train more effectively. Our model outperforms the previous state-of-the-art method in detection accuracy and inference speed greatly upon the THUMOS14 and ActivityNet v1.2 benchmarks.
伪标签学习方法在弱监督下的时序动作定位中得到了广泛应用。现有研究直接利用弱监督基础模型生成实例级伪标签来训练全监督检测头。我们认为,伪标签中的噪声会干扰全监督检测头的学习过程,导致显著的性能损失。带有噪声标签的问题包括:(1)边界定位不准确;(2)未检出短动作片段;(3)多个相邻段落被错误地识别为一个段落。 为了应对这些问题,我们提出了一种两阶段的噪声标签学习策略,以利用噪声标签中的每一个潜在有用信号。首先,我们提出了一个带有上下文感知降噪算法的帧级伪标签生成模型来优化边界定位。其次,我们引入了一个在线修正的教师-学生框架,并配备了一个缺失实例补偿模块和一个模棱两可实例校正模块,用于解决短动作片段丢失以及多个段落被错误地识别为单个段落的问题。此外,我们在改进后的教师-学生框架中应用了高质量伪标签挖掘损失函数,以对噪声标签赋予不同的权重,从而更有效地进行训练。 我们的模型在THUMOS14和ActivityNet v1.2基准测试上,在检测准确性和推理速度方面都显著优于之前的最先进方法。
https://arxiv.org/abs/2501.11124
Weakly supervised violence detection refers to the technique of training models to identify violent segments in videos using only video-level labels. Among these approaches, multimodal violence detection, which integrates modalities such as audio and optical flow, holds great potential. Existing methods in this domain primarily focus on designing multimodal fusion models to address modality discrepancies. In contrast, we take a different approach; leveraging the inherent discrepancies across modalities in violence event representation to propose a novel multimodal semantic feature alignment method. This method sparsely maps the semantic features of local, transient, and less informative modalities ( such as audio and optical flow ) into the more informative RGB semantic feature space. Through an iterative process, the method identifies the suitable no-zero feature matching subspace and aligns the modality-specific event representations based on this subspace, enabling the full exploitation of information from all modalities during the subsequent modality fusion stage. Building on this, we design a new weakly supervised violence detection framework that consists of unimodal multiple-instance learning for extracting unimodal semantic features, multimodal alignment, multimodal fusion, and final detection. Experimental results on benchmark datasets demonstrate the effectiveness of our method, achieving an average precision (AP) of 86.07% on the XD-Violence dataset. Our code is available at this https URL.
弱监督暴力检测是指利用仅有的视频级别标签来训练模型识别视频中的暴力片段的技术。在这些方法中,多模态暴力检测通过整合音频和光流等模式展现了巨大的潜力。现有的领域方法主要集中在设计多模态融合模型以解决不同模式之间的差异问题上。相比之下,我们采取了不同的策略;利用暴力事件表征跨模式的内在差异来提出一种新颖的多模态语义特征对齐方法。该方法稀疏地将局部、瞬时且信息量较少的模式(如音频和光流)的语义特征映射到信息更丰富的RGB语义特征空间中。通过迭代过程,该方法识别出合适的非零特征匹配子空间,并根据此子空间来对齐特定于每个模态的事件表示,从而在后续多模态融合阶段充分挖掘所有模式的信息。 基于这一基础,我们设计了一个新的弱监督暴力检测框架,包括单模态多次实例学习用于提取单模态语义特征、跨模态对齐、多模态融合以及最终的检测。基准数据集上的实验结果表明了该方法的有效性,在XD-Violence数据集中实现了86.07%的平均精度(AP)。我们的代码可在上述提供的链接处获取。
https://arxiv.org/abs/2501.07496
ViSoLex is an open-source system designed to address the unique challenges of lexical normalization for Vietnamese social media text. The platform provides two core services: Non-Standard Word (NSW) Lookup and Lexical Normalization, enabling users to retrieve standard forms of informal language and standardize text containing NSWs. ViSoLex's architecture integrates pre-trained language models and weakly supervised learning techniques to ensure accurate and efficient normalization, overcoming the scarcity of labeled data in Vietnamese. This paper details the system's design, functionality, and its applications for researchers and non-technical users. Additionally, ViSoLex offers a flexible, customizable framework that can be adapted to various datasets and research requirements. By publishing the source code, ViSoLex aims to contribute to the development of more robust Vietnamese natural language processing tools and encourage further research in lexical normalization. Future directions include expanding the system's capabilities for additional languages and improving the handling of more complex non-standard linguistic patterns.
ViSoLex 是一个开源系统,旨在解决越南语社交媒体文本词汇规范化的独特挑战。该平台提供两项核心服务:非标准词(NSW)查找和词汇规范化,使用户能够检索非正式语言的标准形式,并标准化包含非标准词的文本。ViSoLex 的架构结合了预训练的语言模型和弱监督学习技术,以确保准确且高效的规范化过程,克服越南语中标签数据稀缺的问题。 本文详细介绍了该系统的设计、功能及其在研究人员和技术人员中的应用。此外,ViSoLex 提供了一个灵活可定制的框架,可以适应各种数据集和研究需求。通过发布源代码,ViSoLex 旨在为更强大的越南语自然语言处理工具的发展做出贡献,并鼓励词汇规范化领域的进一步研究。 未来发展方向包括扩展系统的功能以支持更多语言以及改进复杂非标准语言模式的处理能力。
https://arxiv.org/abs/2501.07020
Weakly supervised segmentation has the potential to greatly reduce the annotation effort for training segmentation models for small structures such as hyper-reflective foci (HRF) in optical coherence tomography (OCT). However, most weakly supervised methods either involve a strong downsampling of input images, or only achieve localization at a coarse resolution, both of which are unsatisfactory for small structures. We propose a novel framework that increases the spatial resolution of a traditional attention-based Multiple Instance Learning (MIL) approach by using Layer-wise Relevance Propagation (LRP) to prompt the Segment Anything Model (SAM~2), and increases recall with iterative inference. Moreover, we demonstrate that replacing MIL with a Compact Convolutional Transformer (CCT), which adds a positional encoding, and permits an exchange of information between different regions of the OCT image, leads to a further and substantial increase in segmentation accuracy.
弱监督分割技术有潜力大幅减少为训练光学相干断层扫描(OCT)中微小结构(如高反射焦点HRF)的分割模型所需的数据标注工作量。然而,大多数现有的弱监督方法要么需要对输入图像进行强烈的下采样处理,要么仅能在较低分辨率上实现定位,这两种方式都不适用于小型结构的分割任务。为此,我们提出了一种新的框架,该框架通过使用逐层相关性传播(LRP)来提示段一切模型(SAM~2),从而提高了传统基于注意力机制的多实例学习(MIL)方法的空间分辨率,并且通过迭代推理进一步提高召回率。此外,我们还展示了用一个添加了位置编码、并允许不同区域之间信息交换的紧凑卷积变换器(CCT)替代传统的多实例学习方法可以显著提升分割准确度。
https://arxiv.org/abs/2501.05933
Salient Object Detection (SOD) aims to identify and segment prominent regions within a scene. Traditional models rely on manually annotated pseudo labels with precise pixel-level accuracy, which is time-consuming. We developed a low-cost, high-precision annotation method by leveraging large foundation models to address the challenges. Specifically, we use a weakly supervised approach to guide large models in generating pseudo-labels through textual prompts. Since large models do not effectively focus on the salient regions of images, we manually annotate a subset of text to fine-tune the model. Based on this approach, which enables precise and rapid generation of pseudo-labels, we introduce a new dataset, BDS-TR. Compared to the previous DUTS-TR dataset, BDS-TR is more prominent in scale and encompasses a wider variety of categories and scenes. This expansion will enhance our model's applicability across a broader range of scenarios and provide a more comprehensive foundational dataset for future SOD research. Additionally, we present an edge decoder based on dynamic upsampling, which focuses on object edges while gradually recovering image feature resolution. Comprehensive experiments on five benchmark datasets demonstrate that our method significantly outperforms state-of-the-art approaches and also surpasses several existing fully-supervised SOD methods. The code and results will be made available.
目标物体检测(SOD)旨在识别和分割场景中的显著区域。传统模型依赖于手动标注的伪标签,这些伪标签需要精确到像素级别,耗时且成本高昂。我们开发了一种低成本、高精度的标注方法,通过利用大型基础模型来应对这一挑战。具体而言,我们采用弱监督的方法,通过文本提示引导大型模型生成伪标签。由于大规模模型在处理图像显著区域方面表现不佳,我们对一部分文本进行手动注释以微调模型。 基于这种方法,能够实现精确且快速的伪标签生成,我们引入了一个新的数据集BDS-TR。与之前的DUTS-TR数据集相比,BDS-TR在规模上更为宏大,并涵盖了更多种类和场景。这一扩展将增强我们的模型在更广泛场景中的适用性,并为未来的目标物体检测研究提供更加全面的基础数据集。 此外,我们还提出了一种基于动态上采样的边缘解码器,该解码器专注于对象的边缘,同时逐渐恢复图像特征分辨率。我们在五个基准数据集上的综合实验表明,我们的方法显著优于当前最先进的方法,并且在一些现有的全监督SOD方法中也表现出色。 代码和结果将公开提供。
https://arxiv.org/abs/2501.04582
Rotated object detection has made significant progress in the optical remote sensing. However, advancements in the Synthetic Aperture Radar (SAR) field are laggard behind, primarily due to the absence of a large-scale dataset. Annotating such a dataset is inefficient and costly. A promising solution is to employ a weakly supervised model (e.g., trained with available horizontal boxes only) to generate pseudo-rotated boxes for reference before manual calibration. Unfortunately, the existing weakly supervised models exhibit limited accuracy in predicting the object's angle. Previous works attempt to enhance angle prediction by using angle resolvers that decouple angles into cosine and sine encodings. In this work, we first reevaluate these resolvers from a unified perspective of dimension mapping and expose that they share the same shortcomings: these methods overlook the unit cycle constraint inherent in these encodings, easily leading to prediction biases. To address this issue, we propose the Unit Cycle Resolver, which incorporates a unit circle constraint loss to improve angle prediction accuracy. Our approach can effectively improve the performance of existing state-of-the-art weakly supervised methods and even surpasses fully supervised models on existing optical benchmarks (i.e., DOTA-v1.0 dataset). With the aid of UCR, we further annotate and introduce RSAR, the largest multi-class rotated SAR object detection dataset to date. Extensive experiments on both RSAR and optical datasets demonstrate that our UCR enhances angle prediction accuracy. Our dataset and code can be found at: this https URL.
旋转物体检测在光学遥感领域取得了显著进展,然而,在合成孔径雷达(SAR)领域的进步却相对滞后,主要是因为缺乏大规模的数据集。标注这样的数据集既低效又昂贵。一个有前景的解决方案是使用弱监督模型(例如,仅用水平框训练的模型)生成伪旋转框供参考后再进行人工校准。然而,现有的弱监督模型在预测物体角度方面准确性有限。先前的工作试图通过使用角解析器来提高角度预测精度,这些解析器将角度分解为余弦和正弦编码。在这项工作中,我们首先从维度映射的统一视角重新评估了这些解析器,并揭示它们都存在相同的缺点:这些方法忽视了这些编码中固有的单位周期约束,容易导致预测偏差。为了应对这一问题,我们提出了单元圆解析器(Unit Cycle Resolver),该模型引入了一个单位圆约束损失来提高角度预测精度。我们的方法可以有效地改进现有的最先进的弱监督方法的性能,并且在现有的光学基准数据集(例如DOTA-v1.0)上甚至超过了全监督模型的表现。借助UCR,我们进一步标注并介绍了RSAR——迄今为止最大的多类旋转SAR物体检测数据集。在RSAR和光学数据集上的大量实验表明,我们的UCR提升了角度预测的准确性。我们的数据集和代码可以在以下链接找到:this https URL.
https://arxiv.org/abs/2501.04440
With the rapid advancement of deep learning, computational pathology has made significant progress in cancer diagnosis and subtyping. Tissue segmentation is a core challenge, essential for prognosis and treatment decisions. Weakly supervised semantic segmentation (WSSS) reduces the annotation requirement by using image-level labels instead of pixel-level ones. However, Class Activation Map (CAM)-based methods still suffer from low spatial resolution and unclear boundaries. To address these issues, we propose a multi-level superpixel correction algorithm that refines CAM boundaries using superpixel clustering and floodfill. Experimental results show that our method achieves great performance on breast cancer segmentation dataset with mIoU of 71.08%, significantly improving tumor microenvironment boundary delineation.
随着深度学习的迅速发展,计算病理学在癌症诊断和分型方面取得了显著进展。组织分割是其中的核心挑战之一,对于预后和治疗决策至关重要。弱监督语义分割(WSSS)通过使用图像级别的标签而不是像素级别的标签来减少标注需求。然而,基于类激活图(CAM)的方法仍然存在空间分辨率低和边界不清晰的问题。为了解决这些问题,我们提出了一种多级超像素校正算法,该算法利用超像素聚类和洪水填充技术来优化CAM的边界。实验结果表明,我们的方法在乳腺癌分割数据集上表现出色,达到了71.08%的mIoU(平均交并比),显著提高了肿瘤微环境边界的界定精度。
https://arxiv.org/abs/2501.03891
Graph classification plays a pivotal role in various domains, including pathology, where images can be represented as this http URL this domain, images can be represented as graphs, where nodes might represent individual nuclei, and edges capture the spatial or functional relationships between them. Often, the overall label of the graph, such as a cancer type or disease state, is determined by patterns within smaller, localized regions of the image. This work introduces a weakly-supervised graph classification framework leveraging two subgraph extraction techniques: (1) Sliding-window approach (2) BFS-based approach. Subgraphs are processed using a Graph Attention Network (GAT), which employs attention mechanisms to identify the most informative subgraphs for classification. Weak supervision is achieved by propagating graph-level labels to subgraphs, eliminating the need for detailed subgraph annotations.
图分类在病理学等多个领域中扮演着关键角色。在这个领域里,图像可以表示为图结构,其中节点可能代表单个细胞核,边则捕捉它们之间的空间或功能关系。通常,整个图的整体标签(如癌症类型或疾病状态)是由图像中小区域内的模式决定的。 这项工作介绍了一种弱监督下的图分类框架,该框架利用了两种子图提取技术:(1) 滑动窗口方法 (2) 基于广度优先搜索(BFS)的方法。这些子图通过图注意力网络(GAT)进行处理,GAT使用注意力机制来识别对分类最有信息量的子图。弱监督是通过将图级别的标签传播到子图上来实现的,从而消除了对详细子图注释的需求。
https://arxiv.org/abs/2501.02021
Recently, weakly supervised video anomaly detection (WS-VAD) has emerged as a contemporary research direction to identify anomaly events like violence and nudity in videos using only video-level labels. However, this task has substantial challenges, including addressing imbalanced modality information and consistently distinguishing between normal and abnormal features. In this paper, we address these challenges and propose a multi-modal WS-VAD framework to accurately detect anomalies such as violence and nudity. Within the proposed framework, we introduce a new fusion mechanism known as the Cross-modal Fusion Adapter (CFA), which dynamically selects and enhances highly relevant audio-visual features in relation to the visual modality. Additionally, we introduce a Hyperbolic Lorentzian Graph Attention (HLGAtt) to effectively capture the hierarchical relationships between normal and abnormal representations, thereby enhancing feature separation accuracy. Through extensive experiments, we demonstrate that the proposed model achieves state-of-the-art results on benchmark datasets of violence and nudity detection.
最近,弱监督视频异常检测(WS-VAD)作为一种新兴的研究方向出现,旨在仅通过视频级标签来识别暴力和露骨内容等异常事件。然而,这一任务面临着诸如处理不平衡的模态信息以及持续区分正常与异常特征等重大挑战。在本文中,我们解决这些挑战并提出了一种多模态WS-VAD框架,以准确检测如暴力和露骨内容之类的异常情况。在所提出的框架内,我们引入了一个新的融合机制,称为跨模态融合适配器(CFA),该机制能够动态选择并与视觉模式相关的高度相关的音频-视频特征,并对其进行增强。此外,我们还提出了一种双曲洛伦兹图注意力(HLGAtt)方法,以有效捕捉正常和异常表示之间的层次关系,从而提高特征分离的准确性。通过广泛的实验,我们证明了所提出的模型在暴力和露骨内容检测基准数据集上取得了最先进的结果。
https://arxiv.org/abs/2412.20455
Weakly-supervised semantic segmentation (WSSS) has achieved remarkable progress using only image-level labels. However, most existing WSSS methods focus on designing new network structures and loss functions to generate more accurate dense labels, overlooking the limitations imposed by fixed datasets, which can constrain performance improvements. We argue that more diverse trainable images provides WSSS richer information and help model understand more comprehensive semantic pattern. Therefore in this paper, we introduce a novel approach called Image Augmentation Agent (IAA) which shows that it is possible to enhance WSSS from data generation perspective. IAA mainly design an augmentation agent that leverages large language models (LLMs) and diffusion models to automatically generate additional images for WSSS. In practice, to address the instability in prompt generation by LLMs, we develop a prompt self-refinement mechanism. It allow LLMs to re-evaluate the rationality of generated prompts to produce more coherent prompts. Additionally, we insert an online filter into diffusion generation process to dynamically ensure the quality and balance of generated images. Experimental results show that our method significantly surpasses state-of-the-art WSSS approaches on the PASCAL VOC 2012 and MS COCO 2014 datasets.
弱监督语义分割(WSSS)仅使用图像级标签已经取得了显著的进展。然而,现有的大多数WSSS方法主要集中在设计新的网络结构和损失函数以生成更准确的密集标签上,而忽视了固定数据集所带来的限制,这些限制可能制约性能改进。我们认为,提供给模型更多样化的可训练图像可以为WSSS提供更多元的信息,并帮助模型理解更加全面的语义模式。因此,在本文中我们引入了一种新颖的方法——图像增强代理(IAA),证明从数据生成的角度来看,可以提升WSSS的表现。 IAA主要设计了一个利用大规模语言模型(LLMs)和扩散模型自动为WSSS生成额外图像的增强代理。在实践中,为了应对由LLMs生成提示时可能出现的不稳定性问题,我们开发了一种提示自我优化机制。它允许LLMs重新评估所生成提示的合理性,从而产生更连贯的提示。此外,在扩散生成过程中插入了一个在线过滤器以动态确保生成图像的质量和平衡性。 实验结果显示,我们的方法在PASCAL VOC 2012和MS COCO 2014数据集上显著超越了现有最先进的WSSS方法。
https://arxiv.org/abs/2412.20439
Weakly Supervised Monitoring Anomaly Detection (WSMAD) utilizes weak supervision learning to identify anomalies, a critical task for smart city monitoring. However, existing multimodal approaches often fail to meet the real-time and interpretability requirements of edge devices due to their complexity. This paper presents TCVADS (Two-stage Cross-modal Video Anomaly Detection System), which leverages knowledge distillation and cross-modal contrastive learning to enable efficient, accurate, and interpretable anomaly detection on edge this http URL operates in two stages: coarse-grained rapid classification and fine-grained detailed analysis. In the first stage, TCVADS extracts features from video frames and inputs them into a time series analysis module, which acts as the teacher model. Insights are then transferred via knowledge distillation to a simplified convolutional network (student model) for binary classification. Upon detecting an anomaly, the second stage is triggered, employing a fine-grained multi-class classification model. This stage uses CLIP for cross-modal contrastive learning with text and images, enhancing interpretability and achieving refined classification through specially designed triplet textual relationships. Experimental results demonstrate that TCVADS significantly outperforms existing methods in model performance, detection efficiency, and interpretability, offering valuable contributions to smart city monitoring applications.
弱监督监控异常检测(WSMAD)利用弱监督学习来识别异常,这对于智慧城市监测至关重要。然而,现有的多模态方法由于其复杂性往往无法满足边缘设备的实时性和可解释性的要求。本文提出了两阶段跨模态视频异常检测系统(TCVADS),该系统通过知识蒸馏和跨模态对比学习实现高效的、准确的且具有可解释性的边缘设备上的异常检测。TCVADS的操作分为两个阶段:粗粒度快速分类和细粒度详细分析。 在第一阶段,TCVADS从视频帧中提取特征,并将其输入到时间序列分析模块(作为教师模型)。然后通过知识蒸馏将这些洞察转移到简化后的卷积网络(学生模型)进行二元分类。一旦检测到异常,第二阶段就会被触发,使用细粒度的多类分类模型。此阶段利用CLIP(跨模态对比学习框架)结合文本和图像,增强了可解释性,并通过专门设计的三元组文本关系实现了精确分类。 实验结果表明,TCVADS在模型性能、检测效率以及可解释性方面显著优于现有方法,为智慧城市监测应用提供了重要的贡献。
https://arxiv.org/abs/2412.20201
The application of Contrastive Language-Image Pre-training (CLIP) in Weakly Supervised Semantic Segmentation (WSSS) research powerful cross-modal semantic understanding capabilities. Existing methods attempt to optimize input text prompts for improved alignment of images and text, by finely adjusting text prototypes to facilitate semantic matching. Nevertheless, given the modality gap between text and vision spaces, the text prototypes employed by these methods have not effectively established a close correspondence with pixel-level vision features. In this work, our theoretical analysis indicates that the inherent modality gap results in misalignment of text and region features, and that this gap cannot be sufficiently reduced by minimizing contrast loss in CLIP. To mitigate the impact of the modality gap, we propose a Vision Prototype Learning (VPL) framework, by introducing more representative vision prototypes. The core of this framework is to learn class-specific vision prototypes in vision space with the help of text prototypes, for capturing high-quality localization maps. Moreover, we propose a regional semantic contrast module that contrasts regions embedding with corresponding prototypes, leading to more comprehensive and robust feature learning. Experimental results show that our proposed framework achieves state-of-the-art performance on two benchmark datasets.
对比语言-图像预训练(CLIP)在弱监督语义分割(WSSS)研究中的应用,展示了强大的跨模态语义理解能力。现有方法试图通过微调文本原型来优化输入的文本提示,以提高图像和文本之间的对齐效果,从而促进语义匹配。然而,鉴于文本和视觉空间之间存在的模态差距,这些方法使用的文本原型未能有效与像素级视觉特征建立紧密对应关系。 在这项工作中,我们的理论分析表明,内在的模态差距导致了文本特征和区域特征之间的不对齐,并且通过最小化CLIP中的对比损失无法充分缩小这种差距。为减轻模态差距的影响,我们提出了一种视觉原型学习(VPL)框架,引入更具代表性的视觉原型来解决这一问题。该框架的核心是在视觉空间中利用文本原型来学习特定于类别的视觉原型,以捕捉高质量的定位图。 此外,我们还提出了一种区域语义对比模块,通过将区域嵌入与相应的原型进行对比,促进更全面和稳健的特征学习。实验结果表明,在两个基准数据集上,我们的框架取得了最先进的性能。
https://arxiv.org/abs/2412.19650