Weakly supervised violence detection refers to the technique of training models to identify violent segments in videos using only video-level labels. Among these approaches, multimodal violence detection, which integrates modalities such as audio and optical flow, holds great potential. Existing methods in this domain primarily focus on designing multimodal fusion models to address modality discrepancies. In contrast, we take a different approach; leveraging the inherent discrepancies across modalities in violence event representation to propose a novel multimodal semantic feature alignment method. This method sparsely maps the semantic features of local, transient, and less informative modalities ( such as audio and optical flow ) into the more informative RGB semantic feature space. Through an iterative process, the method identifies the suitable no-zero feature matching subspace and aligns the modality-specific event representations based on this subspace, enabling the full exploitation of information from all modalities during the subsequent modality fusion stage. Building on this, we design a new weakly supervised violence detection framework that consists of unimodal multiple-instance learning for extracting unimodal semantic features, multimodal alignment, multimodal fusion, and final detection. Experimental results on benchmark datasets demonstrate the effectiveness of our method, achieving an average precision (AP) of 86.07% on the XD-Violence dataset. Our code is available at this https URL.
弱监督暴力检测是指利用仅有的视频级别标签来训练模型识别视频中的暴力片段的技术。在这些方法中,多模态暴力检测通过整合音频和光流等模式展现了巨大的潜力。现有的领域方法主要集中在设计多模态融合模型以解决不同模式之间的差异问题上。相比之下,我们采取了不同的策略;利用暴力事件表征跨模式的内在差异来提出一种新颖的多模态语义特征对齐方法。该方法稀疏地将局部、瞬时且信息量较少的模式(如音频和光流)的语义特征映射到信息更丰富的RGB语义特征空间中。通过迭代过程,该方法识别出合适的非零特征匹配子空间,并根据此子空间来对齐特定于每个模态的事件表示,从而在后续多模态融合阶段充分挖掘所有模式的信息。 基于这一基础,我们设计了一个新的弱监督暴力检测框架,包括单模态多次实例学习用于提取单模态语义特征、跨模态对齐、多模态融合以及最终的检测。基准数据集上的实验结果表明了该方法的有效性,在XD-Violence数据集中实现了86.07%的平均精度(AP)。我们的代码可在上述提供的链接处获取。
https://arxiv.org/abs/2501.07496
ViSoLex is an open-source system designed to address the unique challenges of lexical normalization for Vietnamese social media text. The platform provides two core services: Non-Standard Word (NSW) Lookup and Lexical Normalization, enabling users to retrieve standard forms of informal language and standardize text containing NSWs. ViSoLex's architecture integrates pre-trained language models and weakly supervised learning techniques to ensure accurate and efficient normalization, overcoming the scarcity of labeled data in Vietnamese. This paper details the system's design, functionality, and its applications for researchers and non-technical users. Additionally, ViSoLex offers a flexible, customizable framework that can be adapted to various datasets and research requirements. By publishing the source code, ViSoLex aims to contribute to the development of more robust Vietnamese natural language processing tools and encourage further research in lexical normalization. Future directions include expanding the system's capabilities for additional languages and improving the handling of more complex non-standard linguistic patterns.
ViSoLex 是一个开源系统,旨在解决越南语社交媒体文本词汇规范化的独特挑战。该平台提供两项核心服务:非标准词(NSW)查找和词汇规范化,使用户能够检索非正式语言的标准形式,并标准化包含非标准词的文本。ViSoLex 的架构结合了预训练的语言模型和弱监督学习技术,以确保准确且高效的规范化过程,克服越南语中标签数据稀缺的问题。 本文详细介绍了该系统的设计、功能及其在研究人员和技术人员中的应用。此外,ViSoLex 提供了一个灵活可定制的框架,可以适应各种数据集和研究需求。通过发布源代码,ViSoLex 旨在为更强大的越南语自然语言处理工具的发展做出贡献,并鼓励词汇规范化领域的进一步研究。 未来发展方向包括扩展系统的功能以支持更多语言以及改进复杂非标准语言模式的处理能力。
https://arxiv.org/abs/2501.07020
Weakly supervised segmentation has the potential to greatly reduce the annotation effort for training segmentation models for small structures such as hyper-reflective foci (HRF) in optical coherence tomography (OCT). However, most weakly supervised methods either involve a strong downsampling of input images, or only achieve localization at a coarse resolution, both of which are unsatisfactory for small structures. We propose a novel framework that increases the spatial resolution of a traditional attention-based Multiple Instance Learning (MIL) approach by using Layer-wise Relevance Propagation (LRP) to prompt the Segment Anything Model (SAM~2), and increases recall with iterative inference. Moreover, we demonstrate that replacing MIL with a Compact Convolutional Transformer (CCT), which adds a positional encoding, and permits an exchange of information between different regions of the OCT image, leads to a further and substantial increase in segmentation accuracy.
弱监督分割技术有潜力大幅减少为训练光学相干断层扫描(OCT)中微小结构(如高反射焦点HRF)的分割模型所需的数据标注工作量。然而,大多数现有的弱监督方法要么需要对输入图像进行强烈的下采样处理,要么仅能在较低分辨率上实现定位,这两种方式都不适用于小型结构的分割任务。为此,我们提出了一种新的框架,该框架通过使用逐层相关性传播(LRP)来提示段一切模型(SAM~2),从而提高了传统基于注意力机制的多实例学习(MIL)方法的空间分辨率,并且通过迭代推理进一步提高召回率。此外,我们还展示了用一个添加了位置编码、并允许不同区域之间信息交换的紧凑卷积变换器(CCT)替代传统的多实例学习方法可以显著提升分割准确度。
https://arxiv.org/abs/2501.05933
Salient Object Detection (SOD) aims to identify and segment prominent regions within a scene. Traditional models rely on manually annotated pseudo labels with precise pixel-level accuracy, which is time-consuming. We developed a low-cost, high-precision annotation method by leveraging large foundation models to address the challenges. Specifically, we use a weakly supervised approach to guide large models in generating pseudo-labels through textual prompts. Since large models do not effectively focus on the salient regions of images, we manually annotate a subset of text to fine-tune the model. Based on this approach, which enables precise and rapid generation of pseudo-labels, we introduce a new dataset, BDS-TR. Compared to the previous DUTS-TR dataset, BDS-TR is more prominent in scale and encompasses a wider variety of categories and scenes. This expansion will enhance our model's applicability across a broader range of scenarios and provide a more comprehensive foundational dataset for future SOD research. Additionally, we present an edge decoder based on dynamic upsampling, which focuses on object edges while gradually recovering image feature resolution. Comprehensive experiments on five benchmark datasets demonstrate that our method significantly outperforms state-of-the-art approaches and also surpasses several existing fully-supervised SOD methods. The code and results will be made available.
目标物体检测(SOD)旨在识别和分割场景中的显著区域。传统模型依赖于手动标注的伪标签,这些伪标签需要精确到像素级别,耗时且成本高昂。我们开发了一种低成本、高精度的标注方法,通过利用大型基础模型来应对这一挑战。具体而言,我们采用弱监督的方法,通过文本提示引导大型模型生成伪标签。由于大规模模型在处理图像显著区域方面表现不佳,我们对一部分文本进行手动注释以微调模型。 基于这种方法,能够实现精确且快速的伪标签生成,我们引入了一个新的数据集BDS-TR。与之前的DUTS-TR数据集相比,BDS-TR在规模上更为宏大,并涵盖了更多种类和场景。这一扩展将增强我们的模型在更广泛场景中的适用性,并为未来的目标物体检测研究提供更加全面的基础数据集。 此外,我们还提出了一种基于动态上采样的边缘解码器,该解码器专注于对象的边缘,同时逐渐恢复图像特征分辨率。我们在五个基准数据集上的综合实验表明,我们的方法显著优于当前最先进的方法,并且在一些现有的全监督SOD方法中也表现出色。 代码和结果将公开提供。
https://arxiv.org/abs/2501.04582
Rotated object detection has made significant progress in the optical remote sensing. However, advancements in the Synthetic Aperture Radar (SAR) field are laggard behind, primarily due to the absence of a large-scale dataset. Annotating such a dataset is inefficient and costly. A promising solution is to employ a weakly supervised model (e.g., trained with available horizontal boxes only) to generate pseudo-rotated boxes for reference before manual calibration. Unfortunately, the existing weakly supervised models exhibit limited accuracy in predicting the object's angle. Previous works attempt to enhance angle prediction by using angle resolvers that decouple angles into cosine and sine encodings. In this work, we first reevaluate these resolvers from a unified perspective of dimension mapping and expose that they share the same shortcomings: these methods overlook the unit cycle constraint inherent in these encodings, easily leading to prediction biases. To address this issue, we propose the Unit Cycle Resolver, which incorporates a unit circle constraint loss to improve angle prediction accuracy. Our approach can effectively improve the performance of existing state-of-the-art weakly supervised methods and even surpasses fully supervised models on existing optical benchmarks (i.e., DOTA-v1.0 dataset). With the aid of UCR, we further annotate and introduce RSAR, the largest multi-class rotated SAR object detection dataset to date. Extensive experiments on both RSAR and optical datasets demonstrate that our UCR enhances angle prediction accuracy. Our dataset and code can be found at: this https URL.
旋转物体检测在光学遥感领域取得了显著进展,然而,在合成孔径雷达(SAR)领域的进步却相对滞后,主要是因为缺乏大规模的数据集。标注这样的数据集既低效又昂贵。一个有前景的解决方案是使用弱监督模型(例如,仅用水平框训练的模型)生成伪旋转框供参考后再进行人工校准。然而,现有的弱监督模型在预测物体角度方面准确性有限。先前的工作试图通过使用角解析器来提高角度预测精度,这些解析器将角度分解为余弦和正弦编码。在这项工作中,我们首先从维度映射的统一视角重新评估了这些解析器,并揭示它们都存在相同的缺点:这些方法忽视了这些编码中固有的单位周期约束,容易导致预测偏差。为了应对这一问题,我们提出了单元圆解析器(Unit Cycle Resolver),该模型引入了一个单位圆约束损失来提高角度预测精度。我们的方法可以有效地改进现有的最先进的弱监督方法的性能,并且在现有的光学基准数据集(例如DOTA-v1.0)上甚至超过了全监督模型的表现。借助UCR,我们进一步标注并介绍了RSAR——迄今为止最大的多类旋转SAR物体检测数据集。在RSAR和光学数据集上的大量实验表明,我们的UCR提升了角度预测的准确性。我们的数据集和代码可以在以下链接找到:this https URL.
https://arxiv.org/abs/2501.04440
With the rapid advancement of deep learning, computational pathology has made significant progress in cancer diagnosis and subtyping. Tissue segmentation is a core challenge, essential for prognosis and treatment decisions. Weakly supervised semantic segmentation (WSSS) reduces the annotation requirement by using image-level labels instead of pixel-level ones. However, Class Activation Map (CAM)-based methods still suffer from low spatial resolution and unclear boundaries. To address these issues, we propose a multi-level superpixel correction algorithm that refines CAM boundaries using superpixel clustering and floodfill. Experimental results show that our method achieves great performance on breast cancer segmentation dataset with mIoU of 71.08%, significantly improving tumor microenvironment boundary delineation.
随着深度学习的迅速发展,计算病理学在癌症诊断和分型方面取得了显著进展。组织分割是其中的核心挑战之一,对于预后和治疗决策至关重要。弱监督语义分割(WSSS)通过使用图像级别的标签而不是像素级别的标签来减少标注需求。然而,基于类激活图(CAM)的方法仍然存在空间分辨率低和边界不清晰的问题。为了解决这些问题,我们提出了一种多级超像素校正算法,该算法利用超像素聚类和洪水填充技术来优化CAM的边界。实验结果表明,我们的方法在乳腺癌分割数据集上表现出色,达到了71.08%的mIoU(平均交并比),显著提高了肿瘤微环境边界的界定精度。
https://arxiv.org/abs/2501.03891
Graph classification plays a pivotal role in various domains, including pathology, where images can be represented as this http URL this domain, images can be represented as graphs, where nodes might represent individual nuclei, and edges capture the spatial or functional relationships between them. Often, the overall label of the graph, such as a cancer type or disease state, is determined by patterns within smaller, localized regions of the image. This work introduces a weakly-supervised graph classification framework leveraging two subgraph extraction techniques: (1) Sliding-window approach (2) BFS-based approach. Subgraphs are processed using a Graph Attention Network (GAT), which employs attention mechanisms to identify the most informative subgraphs for classification. Weak supervision is achieved by propagating graph-level labels to subgraphs, eliminating the need for detailed subgraph annotations.
图分类在病理学等多个领域中扮演着关键角色。在这个领域里,图像可以表示为图结构,其中节点可能代表单个细胞核,边则捕捉它们之间的空间或功能关系。通常,整个图的整体标签(如癌症类型或疾病状态)是由图像中小区域内的模式决定的。 这项工作介绍了一种弱监督下的图分类框架,该框架利用了两种子图提取技术:(1) 滑动窗口方法 (2) 基于广度优先搜索(BFS)的方法。这些子图通过图注意力网络(GAT)进行处理,GAT使用注意力机制来识别对分类最有信息量的子图。弱监督是通过将图级别的标签传播到子图上来实现的,从而消除了对详细子图注释的需求。
https://arxiv.org/abs/2501.02021
Recently, weakly supervised video anomaly detection (WS-VAD) has emerged as a contemporary research direction to identify anomaly events like violence and nudity in videos using only video-level labels. However, this task has substantial challenges, including addressing imbalanced modality information and consistently distinguishing between normal and abnormal features. In this paper, we address these challenges and propose a multi-modal WS-VAD framework to accurately detect anomalies such as violence and nudity. Within the proposed framework, we introduce a new fusion mechanism known as the Cross-modal Fusion Adapter (CFA), which dynamically selects and enhances highly relevant audio-visual features in relation to the visual modality. Additionally, we introduce a Hyperbolic Lorentzian Graph Attention (HLGAtt) to effectively capture the hierarchical relationships between normal and abnormal representations, thereby enhancing feature separation accuracy. Through extensive experiments, we demonstrate that the proposed model achieves state-of-the-art results on benchmark datasets of violence and nudity detection.
最近,弱监督视频异常检测(WS-VAD)作为一种新兴的研究方向出现,旨在仅通过视频级标签来识别暴力和露骨内容等异常事件。然而,这一任务面临着诸如处理不平衡的模态信息以及持续区分正常与异常特征等重大挑战。在本文中,我们解决这些挑战并提出了一种多模态WS-VAD框架,以准确检测如暴力和露骨内容之类的异常情况。在所提出的框架内,我们引入了一个新的融合机制,称为跨模态融合适配器(CFA),该机制能够动态选择并与视觉模式相关的高度相关的音频-视频特征,并对其进行增强。此外,我们还提出了一种双曲洛伦兹图注意力(HLGAtt)方法,以有效捕捉正常和异常表示之间的层次关系,从而提高特征分离的准确性。通过广泛的实验,我们证明了所提出的模型在暴力和露骨内容检测基准数据集上取得了最先进的结果。
https://arxiv.org/abs/2412.20455
Weakly-supervised semantic segmentation (WSSS) has achieved remarkable progress using only image-level labels. However, most existing WSSS methods focus on designing new network structures and loss functions to generate more accurate dense labels, overlooking the limitations imposed by fixed datasets, which can constrain performance improvements. We argue that more diverse trainable images provides WSSS richer information and help model understand more comprehensive semantic pattern. Therefore in this paper, we introduce a novel approach called Image Augmentation Agent (IAA) which shows that it is possible to enhance WSSS from data generation perspective. IAA mainly design an augmentation agent that leverages large language models (LLMs) and diffusion models to automatically generate additional images for WSSS. In practice, to address the instability in prompt generation by LLMs, we develop a prompt self-refinement mechanism. It allow LLMs to re-evaluate the rationality of generated prompts to produce more coherent prompts. Additionally, we insert an online filter into diffusion generation process to dynamically ensure the quality and balance of generated images. Experimental results show that our method significantly surpasses state-of-the-art WSSS approaches on the PASCAL VOC 2012 and MS COCO 2014 datasets.
弱监督语义分割(WSSS)仅使用图像级标签已经取得了显著的进展。然而,现有的大多数WSSS方法主要集中在设计新的网络结构和损失函数以生成更准确的密集标签上,而忽视了固定数据集所带来的限制,这些限制可能制约性能改进。我们认为,提供给模型更多样化的可训练图像可以为WSSS提供更多元的信息,并帮助模型理解更加全面的语义模式。因此,在本文中我们引入了一种新颖的方法——图像增强代理(IAA),证明从数据生成的角度来看,可以提升WSSS的表现。 IAA主要设计了一个利用大规模语言模型(LLMs)和扩散模型自动为WSSS生成额外图像的增强代理。在实践中,为了应对由LLMs生成提示时可能出现的不稳定性问题,我们开发了一种提示自我优化机制。它允许LLMs重新评估所生成提示的合理性,从而产生更连贯的提示。此外,在扩散生成过程中插入了一个在线过滤器以动态确保生成图像的质量和平衡性。 实验结果显示,我们的方法在PASCAL VOC 2012和MS COCO 2014数据集上显著超越了现有最先进的WSSS方法。
https://arxiv.org/abs/2412.20439
Weakly Supervised Monitoring Anomaly Detection (WSMAD) utilizes weak supervision learning to identify anomalies, a critical task for smart city monitoring. However, existing multimodal approaches often fail to meet the real-time and interpretability requirements of edge devices due to their complexity. This paper presents TCVADS (Two-stage Cross-modal Video Anomaly Detection System), which leverages knowledge distillation and cross-modal contrastive learning to enable efficient, accurate, and interpretable anomaly detection on edge this http URL operates in two stages: coarse-grained rapid classification and fine-grained detailed analysis. In the first stage, TCVADS extracts features from video frames and inputs them into a time series analysis module, which acts as the teacher model. Insights are then transferred via knowledge distillation to a simplified convolutional network (student model) for binary classification. Upon detecting an anomaly, the second stage is triggered, employing a fine-grained multi-class classification model. This stage uses CLIP for cross-modal contrastive learning with text and images, enhancing interpretability and achieving refined classification through specially designed triplet textual relationships. Experimental results demonstrate that TCVADS significantly outperforms existing methods in model performance, detection efficiency, and interpretability, offering valuable contributions to smart city monitoring applications.
弱监督监控异常检测(WSMAD)利用弱监督学习来识别异常,这对于智慧城市监测至关重要。然而,现有的多模态方法由于其复杂性往往无法满足边缘设备的实时性和可解释性的要求。本文提出了两阶段跨模态视频异常检测系统(TCVADS),该系统通过知识蒸馏和跨模态对比学习实现高效的、准确的且具有可解释性的边缘设备上的异常检测。TCVADS的操作分为两个阶段:粗粒度快速分类和细粒度详细分析。 在第一阶段,TCVADS从视频帧中提取特征,并将其输入到时间序列分析模块(作为教师模型)。然后通过知识蒸馏将这些洞察转移到简化后的卷积网络(学生模型)进行二元分类。一旦检测到异常,第二阶段就会被触发,使用细粒度的多类分类模型。此阶段利用CLIP(跨模态对比学习框架)结合文本和图像,增强了可解释性,并通过专门设计的三元组文本关系实现了精确分类。 实验结果表明,TCVADS在模型性能、检测效率以及可解释性方面显著优于现有方法,为智慧城市监测应用提供了重要的贡献。
https://arxiv.org/abs/2412.20201
The application of Contrastive Language-Image Pre-training (CLIP) in Weakly Supervised Semantic Segmentation (WSSS) research powerful cross-modal semantic understanding capabilities. Existing methods attempt to optimize input text prompts for improved alignment of images and text, by finely adjusting text prototypes to facilitate semantic matching. Nevertheless, given the modality gap between text and vision spaces, the text prototypes employed by these methods have not effectively established a close correspondence with pixel-level vision features. In this work, our theoretical analysis indicates that the inherent modality gap results in misalignment of text and region features, and that this gap cannot be sufficiently reduced by minimizing contrast loss in CLIP. To mitigate the impact of the modality gap, we propose a Vision Prototype Learning (VPL) framework, by introducing more representative vision prototypes. The core of this framework is to learn class-specific vision prototypes in vision space with the help of text prototypes, for capturing high-quality localization maps. Moreover, we propose a regional semantic contrast module that contrasts regions embedding with corresponding prototypes, leading to more comprehensive and robust feature learning. Experimental results show that our proposed framework achieves state-of-the-art performance on two benchmark datasets.
对比语言-图像预训练(CLIP)在弱监督语义分割(WSSS)研究中的应用,展示了强大的跨模态语义理解能力。现有方法试图通过微调文本原型来优化输入的文本提示,以提高图像和文本之间的对齐效果,从而促进语义匹配。然而,鉴于文本和视觉空间之间存在的模态差距,这些方法使用的文本原型未能有效与像素级视觉特征建立紧密对应关系。 在这项工作中,我们的理论分析表明,内在的模态差距导致了文本特征和区域特征之间的不对齐,并且通过最小化CLIP中的对比损失无法充分缩小这种差距。为减轻模态差距的影响,我们提出了一种视觉原型学习(VPL)框架,引入更具代表性的视觉原型来解决这一问题。该框架的核心是在视觉空间中利用文本原型来学习特定于类别的视觉原型,以捕捉高质量的定位图。 此外,我们还提出了一种区域语义对比模块,通过将区域嵌入与相应的原型进行对比,促进更全面和稳健的特征学习。实验结果表明,在两个基准数据集上,我们的框架取得了最先进的性能。
https://arxiv.org/abs/2412.19650
Weakly supervised temporal action localization (WS-TAL) is a task of targeting at localizing complete action instances and categorizing them with video-level labels. Action-background ambiguity, primarily caused by background noise resulting from aggregation and intra-action variation, is a significant challenge for existing WS-TAL methods. In this paper, we introduce a hybrid multi-head attention (HMHA) module and generalized uncertainty-based evidential fusion (GUEF) module to address the problem. The proposed HMHA effectively enhances RGB and optical flow features by filtering redundant information and adjusting their feature distribution to better align with the WS-TAL task. Additionally, the proposed GUEF adaptively eliminates the interference of background noise by fusing snippet-level evidences to refine uncertainty measurement and select superior foreground feature information, which enables the model to concentrate on integral action instances to achieve better action localization and classification performance. Experimental results conducted on the THUMOS14 dataset demonstrate that our method outperforms state-of-the-art methods. Our code is available in \url{this https URL}.
弱监督时间动作定位(WS-TAL)的任务旨在通过视频级别的标签来定位和分类完整的动作实例。现有方法在处理这一任务时面临的主要挑战是动作背景模糊,这主要是由于背景噪音和动作内部变化导致的。为了解决这个问题,本文提出了一种混合多头注意力(HMHA)模块和基于广义不确定性证据融合(GUEF)模块。 提出的HMHA模块能够有效地通过过滤冗余信息并调整特征分布来增强RGB和光流特征,使其更好地与WS-TAL任务对齐。此外,提出的GUEF模块通过融合片段级别的证据来适应性地消除背景噪音的影响,改进不确定性测量,并选择更优的前景特征信息。这使得模型能够专注于完整的动作实例,从而实现更好的动作定位和分类性能。 在THUMOS14数据集上的实验结果表明,我们的方法优于现有的最先进方法。我们的代码可在此网址获取:[this https URL]。
https://arxiv.org/abs/2412.19418
The idea of disentangled representations is to reduce the data to a set of generative factors that produce it. Typically, such representations are vectors in latent space, where each coordinate corresponds to one of the generative factors. The object can then be modified by changing the value of a particular coordinate, but it is necessary to determine which coordinate corresponds to the desired generative factor -- a difficult task if the vector representation has a high dimension. In this article, we propose ArSyD (Architecture for Symbolic Disentanglement), which represents each generative factor as a vector of the same dimension as the resulting representation. In ArSyD, the object representation is obtained as a superposition of the generative factor vector representations. We call such a representation a \textit{symbolic disentangled representation}. We use the principles of Hyperdimensional Computing (also known as Vector Symbolic Architectures), where symbols are represented as hypervectors, allowing vector operations on them. Disentanglement is achieved by construction, no additional assumptions about the underlying distributions are made during training, and the model is only trained to reconstruct images in a weakly supervised manner. We study ArSyD on the dSprites and CLEVR datasets and provide a comprehensive analysis of the learned symbolic disentangled representations. We also propose new disentanglement metrics that allow comparison of methods using latent representations of different dimensions. ArSyD allows to edit the object properties in a controlled and interpretable way, and the dimensionality of the object property representation coincides with the dimensionality of the object representation itself.
离散表示的思想是将数据简化为一组生成因子,这些因子产生该数据。通常,这样的表示形式是在潜在空间中的向量,其中每个坐标对应于一个生成因子。然后可以通过更改特定坐标的值来修改对象,但需要确定哪个坐标对应于所需的生成因子——如果向量表示的维度很高,则这是一个困难的任务。在本文中,我们提出了 ArSyD(符号离散化架构),该架构将每个生成因素表示为与结果表示具有相同维度的向量。在 ArSyD 中,对象表示通过生成因素向量表示的叠加获得。我们将这种表示称为“**符号离散化表示**”。我们使用超维计算(也称为向量符号体系结构)的原则,在该原则下,符号由超向量表示,并且可以在它们上执行向量操作。通过构建实现离散化,训练过程中没有对基础分布做出额外假设,模型仅在弱监督方式下被训练以重构图像。我们研究了 ArSyD 在 dSprites 和 CLEVR 数据集上的应用,并提供了对学习到的符号离散表示进行全面分析的结果。此外,我们还提出了新的离散化指标,这些指标允许使用不同维度潜在表示的方法之间进行比较。ArSyD 允许以可控和可解释的方式编辑对象属性,且对象属性表示的维度与对象表示本身的维度相同。
https://arxiv.org/abs/2412.19847
Creating fully annotated labels for medical image segmentation is prohibitively time-intensive and costly, emphasizing the necessity for innovative approaches that minimize reliance on detailed annotations. Scribble annotations offer a more cost-effective alternative, significantly reducing the expenses associated with full annotations. However, scribble annotations offer limited and imprecise information, failing to capture the detailed structural and boundary characteristics necessary for accurate organ delineation. To address these challenges, we propose HELPNet, a novel scribble-based weakly supervised segmentation framework, designed to bridge the gap between annotation efficiency and segmentation performance. HELPNet integrates three modules. The Hierarchical perturbations consistency (HPC) module enhances feature learning by employing density-controlled jigsaw perturbations across global, local, and focal views, enabling robust modeling of multi-scale structural representations. Building on this, the Entropy-guided pseudo-label (EGPL) module evaluates the confidence of segmentation predictions using entropy, generating high-quality pseudo-labels. Finally, the structural prior refinement (SPR) module incorporates connectivity and bounded priors to enhance the precision and reliability and pseudo-labels. Experimental results on three public datasets ACDC, MSCMRseg, and CHAOS show that HELPNet significantly outperforms state-of-the-art methods for scribble-based weakly supervised segmentation and achieves performance comparable to fully supervised methods. The code is available at this https URL.
为医学图像分割创建全面标注的标签耗时且成本高昂,这强调了创新方法的必要性,以减少对详细注释的依赖。草图注释提供了一个更具成本效益的选择,显著减少了完整注释相关的费用。然而,草图注释提供的信息有限且不精确,无法捕捉准确器官轮廓所需的详细结构和边界特征。为了应对这些挑战,我们提出了HELPNet,这是一种基于草图的弱监督分割框架,旨在弥合标注效率与分割性能之间的差距。 HELPNet整合了三个模块:层级扰动一致性(HPC)模块通过在全局、局部及焦点视图中使用密度控制的拼图扰动来增强特征学习能力,从而能够对多尺度结构表示进行稳健建模。在此基础上,熵引导伪标签(EGPL)模块利用熵评估分割预测的信心水平,并生成高质量的伪标签。最后,结构先验细化(SPR)模块通过纳入连通性和边界化的先验知识来增强伪标签的准确性和可靠性。 实验结果在三个公开数据集ACDC、MSCMRseg和CHAOS上显示,HELPNet显著超越了基于草图的弱监督分割领域的现有方法,并且实现了与完全监督方法相当的表现。代码可在[此处](https://this https URL)获取。
https://arxiv.org/abs/2412.18738
Improving global school connectivity is critical for ensuring inclusive and equitable quality education. To reliably estimate the cost of connecting schools, governments and connectivity providers require complete and accurate school location data - a resource that is often scarce in many low- and middle-income countries. To address this challenge, we propose a cost-effective, scalable approach to locating schools in high-resolution satellite images using weakly supervised deep learning techniques. Our best models, which combine vision transformers and convolutional neural networks, achieve AUPRC values above 0.96 across 10 pilot African countries. Leveraging explainable AI techniques, our approach can approximate the precise geographical coordinates of the school locations using only low-cost, classification-level annotations. To demonstrate the scalability of our method, we generate nationwide maps of school location predictions in African countries and present a detailed analysis of our results, using Senegal as our case study. Finally, we demonstrate the immediate usability of our work by introducing an interactive web mapping tool to streamline human-in-the-loop model validation efforts by government partners. This work successfully showcases the real-world utility of deep learning and satellite images for planning regional infrastructure and accelerating universal school connectivity.
提升全球学校网络连接对于确保包容和平等的高质量教育至关重要。为了可靠地估计连接学校的成本,政府和互联网服务提供商需要完整且准确的学校位置数据——这一资源在许多低收入和中等收入国家往往十分稀缺。为应对这一挑战,我们提出了一种使用弱监督深度学习技术,在高分辨率卫星图像中定位学校的成本效益高、可扩展的方法。我们的最佳模型结合了视觉变换器和卷积神经网络,在10个试点非洲国家中的AUPRC值均超过0.96。借助可解释的人工智能技术,我们仅通过低成本的分类级注释就能近似确定学校位置的精确地理坐标。为了展示方法的可扩展性,我们在非洲各国生成了全国范围内的学校位置预测地图,并以塞内加尔作为案例研究进行了详细分析。最后,我们展示了工作成果的即时可用性,介绍了一个互动式网络制图工具,以简化政府合作伙伴的人工参与模型验证流程。这项工作成功地展示了深度学习和卫星图像在规划区域基础设施和加速全球学校连接方面的实际应用价值。
https://arxiv.org/abs/2412.14870
3D occupancy perception is gaining increasing attention due to its capability to offer detailed and precise environment representations. Previous weakly-supervised NeRF methods balance efficiency and accuracy, with mIoU varying by 5-10 points due to sampling count along camera rays. Recently, real-time Gaussian splatting has gained widespread popularity in 3D reconstruction, and the occupancy prediction task can also be viewed as a reconstruction task. Consequently, we propose GSRender, which naturally employs 3D Gaussian Splatting for occupancy prediction, simplifying the sampling process. In addition, the limitations of 2D supervision result in duplicate predictions along the same camera ray. We implemented the Ray Compensation (RC) module, which mitigates this issue by compensating for features from adjacent frames. Finally, we redesigned the loss to eliminate the impact of dynamic objects from adjacent frames. Extensive experiments demonstrate that our approach achieves SOTA (state-of-the-art) results in RayIoU (+6.0), while narrowing the gap with 3D supervision methods. Our code will be released soon.
三维占用感知因其能够提供详细和精确的环境表示而越来越受到关注。之前的弱监督NeRF方法在效率和准确性之间取得了平衡,但由于沿相机光线采样的数量不同,mIoU可能相差5-10个百分点。最近,实时高斯散射在3D重建中广受欢迎,占用预测任务也可以被视为一种重建任务。因此,我们提出了GSRender,它自然地采用了三维高斯散射进行占用预测,简化了采样过程。此外,二维监督的局限性导致沿同一相机光线出现重复预测的问题。为此,我们实现了光线补偿(RC)模块,通过补偿来自相邻帧的特征来缓解这一问题。最后,我们重新设计了损失函数以消除相邻帧中动态对象的影响。广泛的实验表明,我们的方法在RayIoU方面取得了SOTA(最先进)的结果(提高了6.0),同时缩小了与三维监督方法之间的差距。我们的代码即将发布。
https://arxiv.org/abs/2412.14579
Partial label learning (PLL) is a complicated weakly supervised multi-classification task compounded by class imbalance. Currently, existing methods only rely on inter-class pseudo-labeling from inter-class features, often overlooking the significant impact of the intra-class imbalanced features combined with the inter-class. To address these limitations, we introduce Granular Ball Representation for Imbalanced PLL (GBRIP), a novel framework for imbalanced PLL. GBRIP utilizes coarse-grained granular ball representation and multi-center loss to construct a granular ball-based nfeature space through unsupervised learning, effectively capturing the feature distribution within each class. GBRIP mitigates the impact of confusing features by systematically refining label disambiguation and estimating imbalance distributions. The novel multi-center loss function enhances learning by emphasizing the relationships between samples and their respective centers within the granular balls. Extensive experiments on standard benchmarks demonstrate that GBRIP outperforms existing state-of-the-art methods, offering a robust solution to the challenges of imbalanced PLL.
https://arxiv.org/abs/2412.14561
Weakly Supervised Semantic Segmentation (WSSS), which leverages image-level labels, has garnered significant attention due to its cost-effectiveness. The previous methods mainly strengthen the inter-class differences to avoid class semantic ambiguity which may lead to erroneous activation. However, they overlook the positive function of some shared information between similar classes. Categories within the same cluster share some similar features. Allowing the model to recognize these features can further relieve the semantic ambiguity between these classes. To effectively identify and utilize this shared information, in this paper, we introduce a novel WSSS framework called Prompt Categories Clustering (PCC). Specifically, we explore the ability of Large Language Models (LLMs) to derive category clusters through prompts. These clusters effectively represent the intrinsic relationships between categories. By integrating this relational information into the training network, our model is able to better learn the hidden connections between categories. Experimental results demonstrate the effectiveness of our approach, showing its ability to enhance performance on the PASCAL VOC 2012 dataset and surpass existing state-of-the-art methods in WSSS.
弱监督语义分割(WSSS)利用图像级别的标签,因其成本效益显著而受到了广泛关注。以往的方法主要通过增强类间差异来避免可能引发错误激活的类别语义模糊性。然而,它们忽视了相似类别之间共享信息的积极作用。同一簇内的类别分享一些相似特征,允许模型识别这些特征可以进一步缓解这些类别的语义模糊性。为了有效地识别和利用这种共享信息,在本文中我们引入了一种新的WSSS框架,称为Prompt Categories Clustering(PCC)。具体而言,我们探讨了大型语言模型(LLMs)通过提示推导类别簇的能力。这些簇有效代表了类别之间的内在关系。将这种关系信息整合到训练网络中,我们的模型能够更好地学习类别的隐含联系。实验结果证明了我们方法的有效性,在PASCAL VOC 2012数据集上提升了性能,并在WSSS领域超越现有的最先进的方法。
https://arxiv.org/abs/2412.13823
Dense video captioning aims to detect and describe all events in untrimmed videos. This paper presents a dense video captioning network called Multi-Concept Cyclic Learning (MCCL), which aims to: (1) detect multiple concepts at the frame level, using these concepts to enhance video features and provide temporal event cues; and (2) design cyclic co-learning between the generator and the localizer within the captioning network to promote semantic perception and event localization. Specifically, we perform weakly supervised concept detection for each frame, and the detected concept embeddings are integrated into the video features to provide event cues. Additionally, video-level concept contrastive learning is introduced to obtain more discriminative concept embeddings. In the captioning network, we establish a cyclic co-learning strategy where the generator guides the localizer for event localization through semantic matching, while the localizer enhances the generator's event semantic perception through location matching, making semantic perception and event localization mutually beneficial. MCCL achieves state-of-the-art performance on the ActivityNet Captions and YouCook2 datasets. Extensive experiments demonstrate its effectiveness and interpretability.
密集视频字幕化旨在检测并描述未剪辑视频中的所有事件。本文提出了一种名为多概念循环学习(MCCL)的密集视频字幕网络,其目标是:(1) 在帧级别上检测多个概念,并使用这些概念来增强视频特征,提供时间事件提示;以及 (2) 设计生成器与定位器之间的循环协同学习机制,在字幕网络内促进语义感知和事件定位。具体而言,我们对每一帧进行弱监督概念检测,将检测到的概念嵌入整合进视频特征中以提供事件线索。此外,引入了视频级概念对比学习来获取更具区分性的概念嵌入。在字幕网络中,我们建立了一种循环协同学习策略:生成器通过语义匹配引导定位器完成事件定位,而定位器则通过位置匹配增强生成器的事件语义感知能力,使语义感知和事件定位相互促进。MCCL 在 ActivityNet Captions 和 YouCook2 数据集上实现了最先进的性能。广泛的实验验证了其有效性和可解释性。
https://arxiv.org/abs/2412.11467
Classifying large images with small or tiny regions of interest (ROI) is challenging due to computational and memory constraints. Weakly supervised memory-efficient patch selectors have achieved results comparable with strongly supervised methods. However, low signal-to-noise ratios and low entropy attention still cause overfitting. We explore these issues using a novel testbed on a memory-efficient cross-attention transformer with Iterative Patch Selection (IPS) as the patch selection module. Our testbed extends the megapixel MNIST benchmark to four smaller O2I (object-to-image) ratios ranging from 0.01% to 0.14% while keeping the canvas size fixed and introducing a noise generation component based on Bézier curves. Experimental results generalize the observations made on CNNs to IPS whereby the O2I threshold below which the classifier fails to generalize is affected by the training dataset size. We further observe that the magnitude of this interaction differs for each task of the Megapixel MNIST. For tasks "Maj" and "Top", the rate is at its highest, followed by tasks "Max" and "Multi" where in the latter, this rate is almost at 0. Moreover, results show that in a low data setting, tuning the patch size to be smaller relative to the ROI improves generalization, resulting in an improvement of + 15% for the megapixel MNIST and + 5% for the Swedish traffic signs dataset compared to the original object-to-patch ratios in IPS. Further outcomes indicate that the similarity between the thickness of the noise component and the digits in the megapixel MNIST gradually causes IPS to fail to generalize, contributing to previous suspicions.
对大型图像进行分类时,如果感兴趣的区域(ROI)很小或非常小,则会由于计算和内存限制而变得具有挑战性。弱监督的高效内存补丁选择器已经实现了与强监督方法相当的结果。然而,低信噪比和低熵注意力仍然会导致过拟合。我们使用一种新型测试平台来探讨这些问题,该平台基于一个记忆高效的交叉注意力转换器,并将迭代补丁选择(IPS)用作补丁选择模块。我们的测试平台将百万像素MNIST基准扩展到四个更小的O2I(对象到图像)比率,范围从0.01%至0.14%,同时保持画布大小不变并引入基于贝塞尔曲线的噪声生成组件。实验结果表明,在卷积神经网络中观察到的现象可以推广到IPS中:分类器在无法泛化的O2I阈值受到训练数据集大小的影响。我们进一步发现,这种交互作用的程度对于Megapixel MNIST中的每个任务都是不同的。“Maj”和“Top”任务的速率最高,其次是“Max”和“Multi”任务,在后者中该速率接近于0。此外,结果表明在低数据设置下,将补丁尺寸调得比ROI小有助于泛化,这使得Megapixel MNIST的性能提高了15%,瑞典交通标志数据集提升了5%相比IPS中的原始对象到补丁比率。进一步的结果显示,噪声组件与megapixel MNIST中数字的厚度之间的相似性逐渐导致IPS无法泛化,这也证实了之前的怀疑。
https://arxiv.org/abs/2412.11237