Automatic speech recognition (ASR) is crucial for human-machine interaction in diverse applications like conversational agents, industrial robotics, call center automation, and automated subtitling. However, developing high-performance ASR models remains challenging, particularly for low-resource languages like Arabic, due to the scarcity of large, labeled speech datasets, which are costly and labor-intensive to produce. In this work, we employ weakly supervised learning to train an Arabic ASR model using the Conformer architecture. Our model is trained from scratch on 15,000 hours of weakly annotated speech data covering both Modern Standard Arabic (MSA) and Dialectal Arabic (DA), eliminating the need for costly manual transcriptions. Despite the absence of human-verified labels, our approach attains state-of-the-art (SOTA) performance, exceeding all previous efforts in the field of Arabic ASR on the standard benchmarks. By demonstrating the effectiveness of weak supervision as a scalable, cost-efficient alternative to traditional supervised approaches, paving the way for improved ASR systems in low resource settings.
自动语音识别(ASR)在诸如对话代理、工业机器人、呼叫中心自动化和自动字幕生成等众多应用中,对于人机交互至关重要。然而,开发高性能的ASR模型仍然具有挑战性,尤其是在像阿拉伯语这样的低资源语言中,因为大而标注良好的语音数据集稀缺且制作成本高昂且耗时。在这项工作中,我们采用弱监督学习方法,使用Conformer架构训练一个阿拉伯语ASR模型。我们的模型从头开始在15,000小时的弱注释语音数据上进行训练,涵盖了现代标准阿拉伯语(MSA)和方言阿拉伯语(DA),从而消除了昂贵的人工转录需求。 尽管缺乏人工验证标签,但我们的方法达到了业内最先进的性能水平,在阿拉伯语ASR的标准基准测试中超过了所有先前的努力。通过证明弱监督作为一种可扩展且成本效益高的替代方案的有效性,传统监督方法在资源匮乏的环境下为改进ASR系统铺平了道路。
https://arxiv.org/abs/2504.12254
The emerging trend in computer vision emphasizes developing universal models capable of simultaneously addressing multiple diverse tasks. Such universality typically requires joint training across multi-domain datasets to ensure effective generalization. However, monocular 3D object detection presents unique challenges in multi-domain training due to the scarcity of datasets annotated with accurate 3D ground-truth labels, especially beyond typical road-based autonomous driving contexts. To address this challenge, we introduce a novel weakly supervised framework leveraging pseudo-labels. Current pretrained models often struggle to accurately detect pedestrians in non-road environments due to inherent dataset biases. Unlike generalized image-based 2D object detection models, achieving similar generalization in monocular 3D detection remains largely unexplored. In this paper, we propose GATE3D, a novel framework designed specifically for generalized monocular 3D object detection via weak supervision. GATE3D effectively bridges domain gaps by employing consistency losses between 2D and 3D predictions. Remarkably, our model achieves competitive performance on the KITTI benchmark as well as on an indoor-office dataset collected by us to evaluate the generalization capabilities of our framework. Our results demonstrate that GATE3D significantly accelerates learning from limited annotated data through effective pre-training strategies, highlighting substantial potential for broader impacts in robotics, augmented reality, and virtual reality applications. Project page: this https URL
计算机视觉领域的新趋势侧重于开发能够同时处理多种不同任务的通用模型。这种普遍性通常需要跨多域数据集进行联合训练,以确保有效的泛化能力。然而,在多域训练中,单目三维物体检测面临着独特挑战,尤其是在基于道路的自动驾驶场景之外,准确标注3D地面真实标签的数据集非常稀缺。为了解决这一难题,我们提出了一种新的弱监督框架,利用伪标签进行学习。 当前预训练模型在非道路环境中准确检测行人时往往遇到困难,这主要是由于数据集中固有的偏差所致。与泛化的基于图像的二维物体检测模型不同,在单目三维检测中实现类似的泛化能力仍然是一个未被充分探索的问题。本文提出了GATE3D框架,该框架专门针对通过弱监督进行泛化的单目三维物体检测设计。 GATE3D通过在2D和3D预测之间使用一致性损失来有效地弥合领域差距。值得注意的是,我们的模型不仅在KITTI基准测试中取得了竞争性的性能,在我们收集的用于评估我们框架泛化能力的室内办公室数据集上也表现出色。实验结果表明,GATE3D能够通过有效的预训练策略从有限标注的数据中快速学习,并展示了其在机器人、增强现实和虚拟现实应用中的广泛应用潜力。 项目主页:[请访问此链接获取更多信息](https://this.url.com)
https://arxiv.org/abs/2504.11014
Knowledge discovery is hindered by the increasing volume of publications and the scarcity of extensive annotated data. To tackle the challenge of information overload, it is essential to employ automated methods for knowledge extraction and processing. Finding the right balance between the level of supervision and the effectiveness of models poses a significant challenge. While supervised techniques generally result in better performance, they have the major drawback of demanding labeled data. This requirement is labor-intensive and time-consuming and hinders scalability when exploring new domains. In this context, our study addresses the challenge of identifying semantic relationships between biomedical entities (e.g., diseases, proteins) in unstructured text while minimizing dependency on supervision. We introduce a suite of unsupervised algorithms based on dependency trees and attention mechanisms and employ a range of pointwise binary classification methods. Transitioning from weakly supervised to fully unsupervised settings, we assess the methods' ability to learn from data with noisy labels. The evaluation on biomedical benchmark datasets explores the effectiveness of the methods. Our approach tackles a central issue in knowledge discovery: balancing performance with minimal supervision. By gradually decreasing supervision, we assess the robustness of pointwise binary classification techniques in handling noisy labels, revealing their capability to shift from weakly supervised to entirely unsupervised scenarios. Comprehensive benchmarking offers insights into the effectiveness of these techniques, suggesting an encouraging direction toward adaptable knowledge discovery systems, representing progress in creating data-efficient methodologies for extracting useful insights when annotated data is limited.
知识发现受到出版物数量不断增加以及详尽标注数据稀缺的阻碍。为应对信息过载的挑战,必须采用自动化的方法进行知识提取和处理。在监督水平与模型效果之间找到平衡点是一项重大挑战。虽然监督技术通常能获得更好的性能表现,但它们的主要缺点是需要大量的标注数据。这一需求既耗时又费力,并且限制了当探索新领域时系统的可扩展性。在此背景下,我们的研究致力于识别无结构文本中生物医学实体(如疾病、蛋白质)之间的语义关系,同时尽量减少对监督的依赖。 为此,我们引入了一套基于依存树和注意力机制的无监督算法,并采用了一系列点式二元分类方法进行评估。通过从弱监督环境过渡到完全无监督设置,我们评估了这些方法在处理带有噪声标签数据时的学习能力。我们在生物医学基准数据集上的实验探讨了这些方法的有效性。 我们的方法解决了知识发现中的一个核心问题:即如何在最小化监督的前提下保持性能的高效性。通过逐步减少监督力度,我们评估点式二元分类技术在处理含噪标签方面的能力,并揭示它们从弱监督环境向完全无监督场景转变的可能性。全面基准测试提供了这些技术效果的洞见,暗示了一个鼓舞人心的方向——朝着适应性强的知识发现系统迈进,这代表着当标注数据有限时高效提取有用信息的方法论进步。
https://arxiv.org/abs/2504.09582
DNA methylation is an epigenetic mechanism that regulates gene expression by adding methyl groups to DNA. Abnormal methylation patterns can disrupt gene expression and have been linked to cancer development. To quantify DNA methylation, specialized assays are typically used. However, these assays are often costly and have lengthy processing times, which limits their widespread availability in routine clinical practice. In contrast, whole slide images (WSIs) for the majority of cancer patients can be more readily available. As such, given the ready availability of WSIs, there is a compelling need to explore the potential relationship between WSIs and DNA methylation patterns. To address this, we propose an end-to-end graph neural network based weakly supervised learning framework to predict the methylation state of gene groups exhibiting coherent patterns across samples. Using data from three cohorts from The Cancer Genome Atlas (TCGA) - TCGA-LGG (Brain Lower Grade Glioma), TCGA-GBM (Glioblastoma Multiforme) ($n$=729) and TCGA-KIRC (Kidney Renal Clear Cell Carcinoma) ($n$=511) - we demonstrate that the proposed approach achieves significantly higher AUROC scores than the state-of-the-art (SOTA) methods, by more than $20\%$. We conduct gene set enrichment analyses on the gene groups and show that majority of the gene groups are significantly enriched in important hallmarks and pathways. We also generate spatially enriched heatmaps to further investigate links between histological patterns and DNA methylation states. To the best of our knowledge, this is the first study that explores association of spatially resolved histological patterns with gene group methylation states across multiple cancer types using weakly supervised deep learning.
DNA 甲基化是一种表观遗传机制,通过向 DNA 添加甲基基团来调控基因表达。异常的甲基化模式会干扰基因表达,并且与癌症发展有关联。为了量化 DNA 甲基化程度,通常使用专门的检测试剂。然而,这些试剂往往成本高昂,处理时间长,从而限制了它们在常规临床实践中的广泛应用。相比之下,大多数癌症患者的整个载玻片图像(WSIs)更容易获取。因此,在 WSIs 可用性较高的情况下,探索 WSIs 与 DNA 甲基化模式之间的潜在关系显得尤为必要。 为解决这一问题,我们提出了一种基于图神经网络的端到端弱监督学习框架来预测具有共同特征的不同样本基因组群的甲基化状态。使用来自癌症基因组图谱(TCGA)三个队列的数据 - TCGA-LGG(低级别胶质瘤脑癌)、TCGA-GBM(多形性胶质母细胞瘤)(n=729)和 TCGA-KIRC(肾透明细胞癌)(n=511),我们证明了所提出的方法比当前最先进的方法在 AUROC 分数上提高了超过 20%。我们对基因组群进行了基因集富集分析,结果显示大多数基因组都在重要的标志物和路径中显著富集。我们还生成了空间富集热图以进一步调查组织学模式与 DNA 甲基化状态之间的联系。据我们所知,这是第一个使用弱监督深度学习探索多种癌症类型中空间解析组织学模式与基因群甲基化状态之间关联的研究。
https://arxiv.org/abs/2504.05403
With the increasing adoption of video anomaly detection in intelligent surveillance domains, conventional visual-based detection approaches often struggle with information insufficiency and high false-positive rates in complex environments. To address these limitations, we present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection. Capitalizing on the exceptional cross-modal representation learning capabilities of Contrastive Language-Image Pretraining (CLIP) across visual, audio, and textual domains, our framework introduces two major innovations: an efficient audio-visual fusion that enables adaptive cross-modal integration through lightweight parametric adaptation while maintaining the frozen CLIP backbone, and a novel audio-visual prompt that dynamically enhances text embeddings with key multimodal information based on the semantic correlation between audio-visual features and textual labels, significantly improving CLIP's generalization for the video anomaly detection task. Moreover, to enhance robustness against modality deficiency during inference, we further develop an uncertainty-driven feature distillation module that synthesizes audio-visual representations from visual-only inputs. This module employs uncertainty modeling based on the diversity of audio-visual features to dynamically emphasize challenging features during the distillation process. Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy in various scenarios. Notably, with unimodal data enhanced by uncertainty-driven distillation, our approach consistently outperforms current unimodal VAD methods.
随着视频异常检测在智能监控领域中的广泛应用,传统的基于视觉的方法常常在复杂环境中面临信息不足和高误报率的问题。为了解决这些问题,我们提出了一种新的弱监督框架,该框架利用音频-视觉协作进行鲁棒的视频异常检测。我们的框架借鉴了对比语言图像预训练(CLIP)在视觉、音频和文本领域中出色的跨模态表示学习能力,并引入了两项主要创新:一种高效的音频-视觉融合方法,它通过轻量级参数调整实现自适应跨模式整合,同时保持CLIP骨干网络不变;以及一种新颖的音频-视觉提示机制,该机制根据音频-视觉特征与文本标签之间的语义关联动态增强文本嵌入中的关键多模态信息,显著提高了CLIP在视频异常检测任务中的泛化能力。 此外,为了提高在推理过程中对模态不足的鲁棒性,我们进一步开发了一个基于不确定性的特征蒸馏模块,该模块可以从仅有的视觉输入中综合音频-视觉表示。此模块采用基于音频-视觉特性多样性的不确定性建模方法,在蒸馏过程中动态强调具有挑战性的特征。 我们的框架在多个基准测试中的性能表现出色,尤其是在各种场景下,通过整合音频信息显著提升了异常检测的准确性。特别值得注意的是,借助不确定度驱动的蒸馏增强单模态数据后,我们提出的方法始终优于现有的单模态视频异常检测方法。
https://arxiv.org/abs/2504.04495
This paper presents a comprehensive evaluation framework for image segmentation algorithms, encompassing naive methods, machine learning approaches, and deep learning techniques. We begin by introducing the fundamental concepts and importance of image segmentation, and the role of interactive segmentation in enhancing accuracy. A detailed background theory section explores various segmentation methods, including thresholding, edge detection, region growing, feature extraction, random forests, support vector machines, convolutional neural networks, U-Net, and Mask R-CNN. The implementation and experimental setup are thoroughly described, highlighting three primary approaches: algorithm assisting user, user assisting algorithm, and hybrid methods. Evaluation metrics such as Intersection over Union (IoU), computation time, and user interaction time are employed to measure performance. A comparative analysis presents detailed results, emphasizing the strengths, limitations, and trade-offs of each method. The paper concludes with insights into the practical applicability of these approaches across various scenarios and outlines future work, focusing on expanding datasets, developing more representative approaches, integrating real-time feedback, and exploring weakly supervised and self-supervised learning paradigms to enhance segmentation accuracy and efficiency. Keywords: Image Segmentation, Interactive Segmentation, Machine Learning, Deep Learning, Computer Vision
本文提出了一种全面评估图像分割算法的框架,涵盖了简单方法、机器学习方法和深度学习技术。文章首先介绍了图像分割的基本概念及其重要性,并探讨了交互式分割在提高准确性方面的作用。详细的背景理论部分探索了各种分割方法,包括阈值处理、边缘检测、区域增长、特征提取、随机森林、支持向量机、卷积神经网络、U-Net和Mask R-CNN。 实施与实验设置被详尽描述,重点介绍了三种主要的方法:算法辅助用户、用户辅助算法以及混合方法。采用交并比(IoU)、计算时间和用户交互时间等评估指标来衡量性能表现。比较分析部分详细展示了各种方法的结果,并强调了每种方法的优势、局限性及权衡因素。 文章最后总结了这些方法在不同场景中的实际应用价值,并展望未来工作,重点关注扩大数据集规模、开发更具代表性的方法、整合实时反馈以及探索弱监督和自监督学习范式以提高分割准确性和效率。关键词包括:图像分割、交互式分割、机器学习、深度学习、计算机视觉。
https://arxiv.org/abs/2504.04435
In this work, we focus on scaling open-vocabulary action detection. Existing approaches for action detection are predominantly limited to closed-set scenarios and rely on complex, parameter-heavy architectures. Extending these models to the open-vocabulary setting poses two key challenges: (1) the lack of large-scale datasets with many action classes for robust training, and (2) parameter-heavy adaptations to a pretrained vision-language contrastive model to convert it for detection, risking overfitting the additional non-pretrained parameters to base action classes. Firstly, we introduce an encoder-only multimodal model for video action detection, reducing the reliance on parameter-heavy additions for video action detection. Secondly, we introduce a simple weakly supervised training strategy to exploit an existing closed-set action detection dataset for pretraining. Finally, we depart from the ill-posed base-to-novel benchmark used by prior works in open-vocabulary action detection and devise a new benchmark to evaluate on existing closed-set action detection datasets without ever using them for training, showing novel results to serve as baselines for future work.
在这项工作中,我们专注于开放词汇动作检测的扩展。现有的动作检测方法主要局限于封闭集场景,并依赖于复杂、参数繁重的架构。将这些模型扩展到开放词汇设置面临两个关键挑战:(1)缺乏具有大量动作类别的大规模数据集以进行稳健训练;(2)对预训练的视觉-语言对比模型进行参数繁重的调整,将其转换为检测模型,这会增加额外未预训练参数过度拟合基础动作类别的风险。首先,我们引入了一个用于视频动作检测的编码器仅模式多模态模型,减少了依赖于视频动作检测中参数繁重的添加项。其次,我们提出了一种简单的弱监督训练策略,利用现有的封闭集动作检测数据集进行预训练。最后,我们偏离了之前工作在开放词汇动作检测中使用的病态基础到新颖基准,并设计了一个新的评估标准,在从未使用过的现有封闭集动作检测数据集上进行测试,展示了新颖的结果作为未来研究的基线。
https://arxiv.org/abs/2504.03096
Accurate segmentation of polyps and skin lesions is essential for diagnosing colorectal and skin cancers. While various segmentation methods for polyps and skin lesions using fully supervised deep learning techniques have been developed, the pixel-level annotation of medical images by doctors is both time-consuming and costly. Foundational vision models like the Segment Anything Model (SAM) have demonstrated superior performance; however, directly applying SAM to medical segmentation may not yield satisfactory results due to the lack of domain-specific medical knowledge. In this paper, we propose BiSeg-SAM, a SAM-guided weakly supervised prompting and boundary refinement network for the segmentation of polyps and skin lesions. Specifically, we fine-tune SAM combined with a CNN module to learn local features. We introduce a WeakBox with two functions: automatically generating box prompts for the SAM model and using our proposed Multi-choice Mask-to-Box (MM2B) transformation for rough mask-to-box conversion, addressing the mismatch between coarse labels and precise predictions. Additionally, we apply scale consistency (SC) loss for prediction scale alignment. Our DetailRefine module enhances boundary precision and segmentation accuracy by refining coarse predictions using a limited amount of ground truth labels. This comprehensive approach enables BiSeg-SAM to achieve excellent multi-task segmentation performance. Our method demonstrates significant superiority over state-of-the-art (SOTA) methods when tested on five polyp datasets and one skin cancer dataset.
结肠息肉和皮肤病变的精确分割对于诊断结直肠癌和皮肤癌至关重要。尽管已经开发出了多种使用全监督深度学习技术对息肉和皮肤病变进行分割的方法,但医生为医学图像提供像素级别的标注既耗时又昂贵。基础视觉模型如Segment Anything Model (SAM) 已经展示了卓越的性能;然而,直接将SAM应用于医学分割可能由于缺乏特定领域的医疗知识而无法获得令人满意的结果。在本文中,我们提出了BiSeg-SAM,这是一个由SAM引导的弱监督提示和边界细化网络,用于息肉和皮肤病变的分割。 具体来说,我们将SAM与CNN模块结合进行微调以学习局部特征。我们引入了一个名为WeakBox的功能组件:它可以自动为SAM模型生成框提示,并使用我们提出的Multi-choice Mask-to-Box (MM2B) 转换技术来进行粗略的掩码到边界框转换,解决了粗糙标签和精确预测之间的不匹配问题。此外,我们应用尺度一致性(SC)损失来对齐预测的尺度。我们的DetailRefine模块通过利用少量的真实标注数据来细化粗略的预测结果以提高边界的精度和分割准确性。这种全面的方法使BiSeg-SAM能够实现卓越的多任务分割性能。 在五个息肉数据集和一个皮肤癌数据集中进行测试时,我们的方法相对于最先进的(SOTA)方法表现出显著的优势。
https://arxiv.org/abs/2504.01452
Weakly supervised object localization (WSOL) methods allow training models to classify images and localize ROIs. WSOL only requires low-cost image-class annotations yet provides a visually interpretable classifier, which is important in histology image analysis. Standard WSOL methods rely on class activation mapping (CAM) methods to produce spatial localization maps according to a single- or two-step strategy. While both strategies have made significant progress, they still face several limitations with histology images. Single-step methods can easily result in under- or over-activation due to the limited visual ROI saliency in histology images and the limited localization cues. They also face the well-known issue of asynchronous convergence between classification and localization tasks. The two-step approach is sub-optimal because it is tied to a frozen classifier, limiting the capacity for localization. Moreover, these methods also struggle when applied to out-of-distribution (OOD) datasets. In this paper, a multi-task approach for WSOL is introduced for simultaneous training of both tasks to address the asynchronous convergence problem. In particular, localization is performed in the pixel-feature space of an image encoder that is shared with classification. This allows learning discriminant features and accurate delineation of foreground/background regions to support ROI localization and image classification. We propose PixelCAM, a cost-effective foreground/background pixel-wise classifier in the pixel-feature space that allows for spatial object localization. PixelCAM is trained using pixel pseudo-labels collected from a pretrained WSOL model. Both image and pixel-wise classifiers are trained simultaneously using standard gradient descent. In addition, our pixel classifier can easily be integrated into CNN- and transformer-based architectures without any modifications.
弱监督目标定位(WSOL)方法允许在对图像进行分类的同时,训练模型来定位感兴趣区域(ROI)。与标准的完全标注数据集相比,WSOL仅需要低成本的图像类别标签,这为医学组织学图像分析提供了重要的视觉可解释性。传统的WSOL方法依赖于类激活图(CAM)技术,通过单步或两步策略生成空间定位地图。 虽然这两种策略都取得了显著进展,但在处理组织学图像时仍面临诸多局限。单步法由于组织学图像中ROI的可视性有限和局部化线索不足,容易导致过强或过弱的激活现象,并且也存在分类任务与定位任务之间异步收敛的问题。两步方法则因为依赖于固定的分类器而难以优化,限制了其在定位方面的表现能力。此外,这些传统方法当应用于分布外(OOD)数据集时也会遇到困难。 本文介绍了一种用于WSOL的多任务方法,通过同时训练两个任务以解决异步收敛问题。具体来说,在一个与分类共享权重的图像编码器像素特征空间中执行定位操作,从而允许学习区分性特征并精确地划分前景/背景区域,支持ROI定位和图像分类。 我们提出了一种名为PixelCAM的方法,它是在像素特征空间中的成本效益型前景/背景像素级分类器,可以用于实现空间物体的定位。PixelCAM通过使用从预训练WSOL模型收集到的像素伪标签进行训练,并且图像和像素级分类器同时采用标准梯度下降法训练。此外,我们的像素分类器能够轻松集成到基于CNN或transformer架构中,无需任何修改。 这种方法不仅在组织学图像上表现良好,在处理分布外数据集时也显示出更优的鲁棒性和适用性。
https://arxiv.org/abs/2503.24135
Weakly supervised video grounding aims to localize temporal boundaries relevant to a given query without explicit ground-truth temporal boundaries. While existing methods primarily use Gaussian-based proposals, they overlook the importance of (1) boundary prediction and (2) top-1 prediction selection during inference. In their boundary prediction, boundaries are simply set at half a standard deviation away from a Gaussian mean on both sides, which may not accurately capture the optimal boundaries. In the top-1 prediction process, these existing methods rely heavily on intersections with other proposals, without considering the varying quality of each proposal. To address these issues, we explore various inference strategies by introducing (1) novel boundary prediction methods to capture diverse boundaries from multiple Gaussians and (2) new selection methods that take proposal quality into account. Extensive experiments on the ActivityNet Captions and Charades-STA datasets validate the effectiveness of our inference strategies, demonstrating performance improvements without requiring additional training.
弱监督视频定位的目标是在没有明确的时间边界标注的情况下,找到与给定查询相关的时间边界。现有方法主要使用基于高斯分布的提案(proposals),但它们忽略了两个关键点:(1) 边界预测的重要性以及 (2) 在推理过程中选择最佳单一提案的重要性。 在现有的边界预测中,边界通常简单地设置为从高斯分布均值出发,在两侧各半标准差的位置。这种做法可能无法准确捕捉到最优的时间边界。而在选择最有可能的单一提案时,现有方法主要依赖于与其他提案的重叠部分(intersections),而没有考虑到每个提案的质量差异。 为了应对这些问题,我们探索了多种推理策略,包括:(1) 引入新的边界预测方法来从多个高斯分布中捕捉多样化的边界;以及 (2) 采用新的选择方法考虑提案质量。在ActivityNet Captions和Charades-STA数据集上的广泛实验验证了我们的推理策略的有效性,并且展示了在不需额外训练的情况下性能的提升。
https://arxiv.org/abs/2503.23181
Tweets provides valuable semantic context for earth observation tasks and serves as a complementary modality to remote sensing imagery. In building function classification (BFC), tweets are often collected using geographic heuristics and labeled via external databases, an inherently weakly supervised process that introduces both label noise and sentence level feature noise (e.g., irrelevant or uninformative tweets). While label noise has been widely studied, the impact of sentence level feature noise remains underexplored, largely due to the lack of clean benchmark datasets for controlled analysis. In this work, we propose a method for generating a synthetic oracle dataset using LLM, designed to contain only tweets that are both correctly labeled and semantically relevant to their associated buildings. This oracle dataset enables systematic investigation of noise impacts that are otherwise difficult to isolate in real-world data. To assess its utility, we compare model performance using Naive Bayes and mBERT classifiers under three configurations: real vs. synthetic training data, and cross-domain generalization. Results show that noise in real tweets significantly degrades the contextual learning capacity of mBERT, reducing its performance to that of a simple keyword-based model. In contrast, the clean synthetic dataset allows mBERT to learn effectively, outperforming Naive Bayes Bayes by a large margin. These findings highlight that addressing feature noise is more critical than model complexity in this task. Our synthetic dataset offers a novel experimental environment for future noise injection studies and is publicly available on GitHub.
推文为地球观测任务提供了有价值的语义背景,并作为遥感图像的补充模态。在建筑物功能分类(BFC)中,通常通过地理启发式方法收集推文,并使用外部数据库进行标注,这一过程本质上是弱监督的,会引入标签噪声和句子级别的特征噪声(例如,无关或无信息的推文)。虽然标签噪声已得到广泛研究,但句子级别特征噪声的影响却鲜有探讨,这主要是因为缺乏用于受控分析的干净基准数据集。在本文中,我们提出了一种使用大型语言模型(LLM)生成合成oracle数据集的方法,该方法仅包含正确标注且语义上与相关建筑物相关的推文。这一合成数据集使得系统性地研究噪声影响成为可能,这些影响在现实世界的数据中难以隔离。为了评估其效用,我们比较了使用朴素贝叶斯和mBERT分类器的模型性能,在三种配置下进行对比:真实数据与合成训练数据,以及跨域泛化能力。结果显示,真实推文中的噪声显著降低了mBERT的情境学习能力,使其表现下降到简单关键词模型的水平。相比之下,干净的合成数据集使mBERT能够有效学习,并大幅超越朴素贝叶斯分类器的表现。这些发现表明,在此任务中解决特征噪声比增加模型复杂性更为关键。我们的合成数据集为未来的噪声注入研究提供了新颖的实验环境,并已公开发布在GitHub上。
https://arxiv.org/abs/2503.22856
Co-speech gestures play a vital role in non-verbal communication. In this paper, we introduce a new framework for co-speech gesture understanding in the wild. Specifically, we propose three new tasks and benchmarks to evaluate a model's capability to comprehend gesture-text-speech associations: (i) gesture-based retrieval, (ii) gestured word spotting, and (iii) active speaker detection using gestures. We present a new approach that learns a tri-modal speech-text-video-gesture representation to solve these tasks. By leveraging a combination of global phrase contrastive loss and local gesture-word coupling loss, we demonstrate that a strong gesture representation can be learned in a weakly supervised manner from videos in the wild. Our learned representations outperform previous methods, including large vision-language models (VLMs), across all three tasks. Further analysis reveals that speech and text modalities capture distinct gesture-related signals, underscoring the advantages of learning a shared tri-modal embedding space. The dataset, model, and code are available at: this https URL
共同言语手势在非语言交流中扮演着重要角色。本文介绍了一种新的框架,用于理解自然环境中的共言语手势。具体而言,我们提出了三个新任务和基准测试来评估模型对手势-文本-语音关联的理解能力:(i) 基于手势的检索;(ii) 手势词检测;以及 (iii) 使用手势进行活跃说话人识别。我们提出了一种新的方法,通过学习一种三模态(语音、文本和视频)的手势表示来解决这些任务。通过结合全局短语对比损失和局部手势-单词耦合损失的组合,我们展示了如何在弱监督下从野外视频中以有效的方式学习到强大的手势表示。我们的所学表现在所有三项任务上均超越了之前的模型方法,包括大规模视觉语言模型(VLMs)。进一步分析显示,语音和文本模态捕获了不同的与手势相关的信号,这强调了在共享三模态嵌入空间中进行学习的优势。数据集、模型和代码可在以下链接获取:[此网址]。
https://arxiv.org/abs/2503.22668
Gestures are essential for enhancing co-speech communication, offering visual emphasis and complementing verbal interactions. While prior work has concentrated on point-level motion or fully supervised data-driven methods, we focus on co-speech gestures, advocating for weakly supervised learning and pixel-level motion deviations. We introduce a weakly supervised framework that learns latent representation deviations, tailored for co-speech gesture video generation. Our approach employs a diffusion model to integrate latent motion features, enabling more precise and nuanced gesture representation. By leveraging weakly supervised deviations in latent space, we effectively generate hand gestures and mouth movements, crucial for realistic video production. Experiments show our method significantly improves video quality, surpassing current state-of-the-art techniques.
手势对于增强伴随言语的交流至关重要,它提供了视觉上的强调,并补充了口头互动。尽管之前的工作主要集中在点级别的运动或完全监督的数据驱动方法上,但我们专注于伴随言语的手势,并倡导使用弱监督学习和像素级别运动偏差的方法。我们介绍了一种弱监督框架,该框架能够学习潜在表示偏差,专门用于生成伴随言语手势的视频。 我们的方法采用扩散模型来整合潜在的运动特征,从而实现更精确、细腻的手势表现。通过在潜在空间中利用弱监督偏差,我们有效生成了手部动作和嘴部运动,这对于真实的视频制作至关重要。 实验结果表明,与当前最先进的技术相比,我们的方法显著提升了视频的质量。
https://arxiv.org/abs/2503.21616
Accurate segmentation of nodules in both 2D breast ultrasound (BUS) and 3D automated breast ultrasound (ABUS) is crucial for clinical diagnosis and treatment planning. Therefore, developing an automated system for nodule segmentation can enhance user independence and expedite clinical analysis. Unlike fully-supervised learning, weakly-supervised segmentation (WSS) can streamline the laborious and intricate annotation process. However, current WSS methods face challenges in achieving precise nodule segmentation, as many of them depend on inaccurate activation maps or inefficient pseudo-mask generation algorithms. In this study, we introduce a novel multi-agent reinforcement learning-based WSS framework called Flip Learning, which relies solely on 2D/3D boxes for accurate segmentation. Specifically, multiple agents are employed to erase the target from the box to facilitate classification tag flipping, with the erased region serving as the predicted segmentation mask. The key contributions of this research are as follows: (1) Adoption of a superpixel/supervoxel-based approach to encode the standardized environment, capturing boundary priors and expediting the learning process. (2) Introduction of three meticulously designed rewards, comprising a classification score reward and two intensity distribution rewards, to steer the agents' erasing process precisely, thereby avoiding both under- and over-segmentation. (3) Implementation of a progressive curriculum learning strategy to enable agents to interact with the environment in a progressively challenging manner, thereby enhancing learning efficiency. Extensively validated on the large in-house BUS and ABUS datasets, our Flip Learning method outperforms state-of-the-art WSS methods and foundation models, and achieves comparable performance as fully-supervised learning algorithms.
在2D乳腺超声(BUS)和3D自动乳腺超声(ABUS)中对结节的准确分割对于临床诊断和治疗计划至关重要。因此,开发一种用于结节分割的自动化系统可以增强用户的独立性,并加速临床分析。与完全监督学习相比,弱监督分割(WSS)能够简化繁琐且复杂的标注过程。然而,现有的WSS方法在实现精确的结节分割方面面临挑战,因为它们往往依赖于不准确的激活图或低效的伪标签生成算法。本研究介绍了一种基于多智能体强化学习的新型WSS框架,称为Flip Learning(翻转学习),该框架仅依靠2D/3D边界框来实现精确分割。具体而言,多个智能体被用于从边界框中擦除目标以促进分类标签翻转,而擦除区域则作为预测分割掩模。 这项研究的主要贡献如下: 1. 采用基于超像素/超体积的方法编码标准化环境,捕捉边界先验并加速学习过程。 2. 引入了三种精心设计的奖励机制,包括分类得分奖励和两种强度分布奖励,以精确引导智能体的擦除过程,并避免过度分割或不足分割。 3. 实施了一种渐进式课程学习策略,使智能体能够逐步面对更具挑战性的环境交互,从而提高学习效率。 在大型内部BUS和ABUS数据集上进行了广泛的验证后,我们的Flip Learning方法优于现有的WSS方法及基础模型,并且与完全监督的算法性能相当。
https://arxiv.org/abs/2503.20685
Malicious image manipulation poses societal risks, increasing the importance of effective image manipulation detection methods. Recent approaches in image manipulation detection have largely been driven by fully supervised approaches, which require labor-intensive pixel-level annotations. Thus, it is essential to explore weakly supervised image manipulation localization methods that only require image-level binary labels for training. However, existing weakly supervised image manipulation methods overlook the importance of edge information for accurate localization, leading to suboptimal localization performance. To address this, we propose a Context-Aware Boundary Localization (CABL) module to aggregate boundary features and learn context-inconsistency for localizing manipulated areas. Furthermore, by leveraging Class Activation Mapping (CAM) and Segment Anything Model (SAM), we introduce the CAM-Guided SAM Refinement (CGSR) module to generate more accurate manipulation localization maps. By integrating two modules, we present a novel weakly supervised framework based on a dual-branch Transformer-CNN architecture. Our method achieves outstanding localization performance across multiple datasets.
恶意图像篡改带来了社会风险,从而增强了有效图像篡改检测方法的重要性。近期的图像篡改检测方法主要依赖于完全监督的方法,这需要耗费大量人力进行像素级别的标注。因此,探索仅需基于图片级别二元标签训练的弱监督图像篡改定位方法至关重要。然而,现有的弱监督图像篡改方法忽视了边缘信息对于精确定位的重要性,导致其在定位性能上不尽如人意。为此,我们提出了一种上下文感知边界定位(CABL)模块,用于聚合边界特征并学习不一致的背景以精确定位篡改区域。此外,通过利用类激活映射(CAM)和分割一切模型(SAM),我们引入了CAM引导下的SAM细化(CGSR)模块,生成更精确的篡改定位图。结合这两种模块,基于双分支Transformer-CNN架构,我们提出了一个新颖的弱监督框架。我们的方法在多个数据集上实现了卓越的定位性能。
https://arxiv.org/abs/2503.20294
With the rapid advancement of pathology foundation models (FMs), the representation learning of whole slide images (WSIs) attracts increasing attention. Existing studies develop high-quality patch feature extractors and employ carefully designed aggregation schemes to derive slide-level representations. However, mainstream weakly supervised slide representation learning methods, primarily based on multiple instance learning (MIL), are tailored to specific downstream tasks, which limits their generalizability. To address this issue, some studies explore unsupervised slide representation learning. However, these approaches focus solely on the visual modality of patches, neglecting the rich semantic information embedded in textual data. In this work, we propose ProAlign, a cross-modal unsupervised slide representation learning framework. Specifically, we leverage a large language model (LLM) to generate descriptive text for the prototype types present in a WSI, introducing patch-text contrast to construct initial prototype embeddings. Furthermore, we propose a parameter-free attention aggregation strategy that utilizes the similarity between patches and these prototypes to form unsupervised slide embeddings applicable to a wide range of downstream tasks. Extensive experiments on four public datasets show that ProAlign outperforms existing unsupervised frameworks and achieves performance comparable to some weakly supervised models.
随着病理基础模型(FMs)的迅速发展,全滑片图像(WSIs)的表示学习引起了越来越多的关注。现有研究开发了高质量的补丁特征提取器,并采用了精心设计的聚合方案以推导出样本级别的表示。然而,主流的弱监督下的滑片表示学习方法主要基于多实例学习(MIL),这些方法是为特定下游任务量身定制的,这限制了它们的泛化能力。为了解决这个问题,一些研究探索了无监督的滑片表示学习方法。但是,这些方法仅关注补丁的视觉模式,忽略了嵌入在文本数据中的丰富语义信息。 在此项工作中,我们提出了一种跨模态无监督滑片表示学习框架ProAlign。具体而言,我们利用大型语言模型(LLM)为WSI中存在的原型类型生成描述性文本,并引入了补丁-文本对比以构建初始的原型嵌入。此外,我们还提出了一个参数自由的注意力聚合策略,该策略利用补丁与这些原型之间的相似度来形成适用于广泛下游任务的无监督滑片嵌入。 在四个公开数据集上进行的大量实验表明,ProAlign优于现有的无监督框架,并且其性能可媲美一些弱监督模型。
https://arxiv.org/abs/2503.20190
We propose a novel fine-grained cross-view localization method that estimates the 3 Degrees of Freedom pose of a ground-level image in an aerial image of the surroundings by matching fine-grained features between the two images. The pose is estimated by aligning a point plane generated from the ground image with a point plane sampled from the aerial image. To generate the ground points, we first map ground image features to a 3D point cloud. Our method then learns to select features along the height dimension to pool the 3D points to a Bird's-Eye-View (BEV) plane. This selection enables us to trace which feature in the ground image contributes to the BEV representation. Next, we sample a set of sparse matches from computed point correspondences between the two point planes and compute their relative pose using Procrustes alignment. Compared to the previous state-of-the-art, our method reduces the mean localization error by 28% on the VIGOR cross-area test set. Qualitative results show that our method learns semantically consistent matches across ground and aerial views through weakly supervised learning from the camera pose.
我们提出了一种新颖的细粒度跨视图定位方法,通过匹配两张图像(地面图像和空中俯瞰图像)之间的细粒度特征来估计地面图像在空中图像中的3个自由度姿态。该姿态是通过对生成自地面图像的点平面与从空中图像中采样得到的点平面进行对齐来估算的。 为了生成地面点,我们首先将地面图像的特征映射到一个三维点云上。然后,我们的方法学习在高度维度选择特征以聚合3D点至鸟瞰图(BEV)平面。这种选择使我们能够追踪哪个地面图像中的特征贡献于BEV表示。 接下来,我们在计算出两个点平面之间的对应关系后采样一组稀疏匹配,并使用普罗克鲁斯特斯对齐法计算它们的相对姿态。 与最新的方法相比,在VIGOR跨区域测试集上,我们的方法将平均定位误差降低了28%。定性结果显示,通过从相机姿态中进行弱监督学习,该方法能够学会在地面视图和空中视图之间找到语义一致的匹配关系。
https://arxiv.org/abs/2503.18725
Weak supervision allows machine learning models to learn from limited or noisy labels, but it introduces challenges in interpretability and reliability - particularly in multi-instance partial label learning (MI-PLL), where models must resolve both ambiguous labels and uncertain instance-label mappings. We propose a semantics for neuro-symbolic framework that integrates Inductive Logic Programming (ILP) to improve MI-PLL by providing structured relational constraints that guide learning. Within our semantic characterization, ILP defines a logical hypothesis space for label transitions, clarifies classifier semantics, and establishes interpretable performance standards. This hybrid approach improves robustness, transparency, and accountability in weakly supervised settings, ensuring neural predictions align with domain knowledge. By embedding weak supervision into a logical framework, we enhance both interpretability and learning, making weak supervision more suitable for real-world, high-stakes applications.
弱监督允许机器学习模型在有限或有噪声的标签下进行学习,但这也带来了可解释性和可靠性方面的挑战——尤其是在多实例部分标记学习(MI-PLL)中,模型必须解决模棱两可的标签和不确定的实例-标签映射问题。我们提出了一种神经符号框架的语义表达方式,该方式结合了归纳逻辑编程(ILP),以通过提供结构化的关系约束来指导学习从而改进 MI-PLL。在我们的语义描述中,ILP 定义了一个逻辑假设空间用于标签转换,明确了分类器的意义,并确立了可解释性的性能标准。这种混合方法提高了弱监督环境下的稳健性、透明度和问责制,确保神经预测与领域知识相一致。通过将弱监督嵌入到逻辑框架中,我们增强了其可解释性和学习能力,使弱监督更适合于现实世界中的高风险应用。
https://arxiv.org/abs/2503.18509
LiDAR (Light Detection and Ranging) enables rapid and accurate acquisition of three-dimensional spatial data, widely applied in remote sensing areas such as surface mapping, environmental monitoring, urban modeling, and forestry inventory. LiDAR remote sensing primarily includes data interpretation and LiDAR-based inversion. However, LiDAR interpretation typically relies on dense and precise annotations, which are costly and time-consuming. Similarly, LiDAR inversion depends on scarce supervisory signals and expensive field surveys for annotations. To address this challenge, weakly supervised learning has gained significant attention in recent years, with many methods emerging to tackle LiDAR remote sensing tasks using incomplete, inaccurate, and inexact annotations, as well as annotations from other domains. Existing review articles treat LiDAR interpretation and inversion as separate tasks. This review, for the first time, adopts a unified weakly supervised learning perspective to systematically examine research on both LiDAR interpretation and inversion. We summarize the latest advancements, provide a comprehensive review of the development and application of weakly supervised techniques in LiDAR remote sensing, and discuss potential future research directions in this field.
LiDAR(激光探测与测距)技术能够快速且精确地获取三维空间数据,在遥感领域,如地形测绘、环境监测、城市建模和森林资源调查等方面得到广泛应用。LiDAR遥感主要包括数据解释和基于LiDAR的反演两个方面。然而,传统的LiDAR数据解释通常依赖于密集而精准的手动标注,这既耗时又成本高昂。同时,基于LiDAR的反演也需要稀缺且昂贵的现场调查来获取必要的监督信号。 为了解决这些问题,在过去几年中,弱监督学习方法得到了广泛的关注,并且出现了许多使用不完整、不准确或来自其他领域的注释数据的方法,以解决LiDAR遥感任务。现有的综述文章通常将LiDAR解释和反演视为两个独立的任务进行讨论。本文首次从统一的弱监督学习视角出发,系统地考察了有关这两个方面的研究工作。我们总结最新的进展,并全面回顾在LiDAR遥感中应用弱监督技术的发展与应用情况,同时探讨该领域的潜在未来研究方向。
https://arxiv.org/abs/2503.18384
Automatic medical image segmentation plays a crucial role in computer aided diagnosis. However, fully supervised learning approaches often require extensive and labor-intensive annotation efforts. To address this challenge, weakly supervised learning methods, particularly those using extreme points as supervisory signals, have the potential to offer an effective solution. In this paper, we introduce Deep Extreme Point Tracing (DEPT) integrated with Feature-Guided Extreme Point Masking (FGEPM) algorithm for ultrasound image segmentation. Notably, our method generates pseudo labels by identifying the lowest-cost path that connects all extreme points on the feature map-based cost matrix. Additionally, an iterative training strategy is proposed to refine pseudo labels progressively, enabling continuous network improvement. Experimental results on two public datasets demonstrate the effectiveness of our proposed method. The performance of our method approaches that of the fully supervised method and outperforms several existing weakly supervised methods.
自动医学图像分割在计算机辅助诊断中扮演着关键角色。然而,完全监督的学习方法通常需要大量的标注工作,这既耗时又费力。为了解决这一挑战,采用极值点作为监督信号的弱监督学习方法具有提供有效解决方案的潜力。在这篇论文中,我们引入了一种结合特征引导极值点掩码(Feature-Guided Extreme Point Masking, FGEPM)算法的深度极值点追踪(Deep Extreme Point Tracing, DEPT)方法,用于超声图像分割。特别地,我们的方法通过在基于特征图的成本矩阵上找到连接所有极值点的最低成本路径来生成伪标签。此外,我们还提出了一种迭代训练策略,逐步精炼伪标签,从而实现网络性能的持续改进。实验结果表明,在两个公开数据集上的表现证明了我们所提方法的有效性。我们的方法在性能上接近完全监督的方法,并且优于几种现有的弱监督方法。
https://arxiv.org/abs/2503.15260