Tuberculosis (TB) remains one of the leading causes of mortality worldwide, particularly in resource-limited countries. Chest X-ray (CXR) imaging serves as an accessible and cost-effective diagnostic tool but requires expert interpretation, which is often unavailable. Although machine learning models have shown high performance in TB classification, they often depend on spurious correlations and fail to generalize. Besides, building large datasets featuring high-quality annotations for medical images demands substantial resources and input from domain specialists, and typically involves several annotators reaching agreement, which results in enormous financial and logistical expenses. This study repurposes knowledge distillation technique to train CNN models reducing spurious correlations and localize TB-related abnormalities without requiring bounding-box annotations. By leveraging a teacher-student framework with ResNet50 architecture, the proposed method trained on TBX11k dataset achieve impressive 0.2428 mIOU score. Experimental results further reveal that the student model consistently outperforms the teacher, underscoring improved robustness and potential for broader clinical deployment in diverse settings.
结核病(TB)仍然是全球死亡的主要原因之一,特别是在资源有限的国家。胸部X光片(CXR)成像是一个可获取且成本效益高的诊断工具,但需要专家解读,而这在许多地方并不容易获得。尽管机器学习模型在结核病分类方面表现出色,但它们往往依赖于虚假的相关性,并且难以泛化到不同的场景中去。此外,构建包含高质量标注的大型医学图像数据集需要大量的资源和领域专家的参与,并通常涉及多个注释者达成一致意见的过程,这带来了巨大的财务和后勤成本。 本研究重新利用了知识蒸馏技术来训练卷积神经网络(CNN)模型,以减少虚假相关性并定位结核病相关的异常情况,而无需边界框标注。通过使用基于ResNet50架构的教师-学生框架,在TBX11k数据集上训练的模型取得了令人印象深刻的mIOU评分为0.2428(假设这是一个合理的性能指标表述)。实验结果进一步显示,学生模型在多种情况下均优于教师模型,这表明了其改进后的鲁棒性以及在不同临床环境中更广泛部署的应用潜力。
https://arxiv.org/abs/2512.11057
Weakly supervised semantic segmentation offers a label-efficient solution to train segmentation models for volumetric medical imaging. However, existing approaches often rely on 2D encoders that neglect the inherent volumetric nature of the data. We propose TranSamba, a hybrid Transformer-Mamba architecture designed to capture 3D context for weakly supervised volumetric medical segmentation. TranSamba augments a standard Vision Transformer backbone with Cross-Plane Mamba blocks, which leverage the linear complexity of state space models for efficient information exchange across neighboring slices. The information exchange enhances the pairwise self-attention within slices computed by the Transformer blocks, directly contributing to the attention maps for object localization. TranSamba achieves effective volumetric modeling with time complexity that scales linearly with the input volume depth and maintains constant memory usage for batch processing. Extensive experiments on three datasets demonstrate that TranSamba establishes new state-of-the-art performance, consistently outperforming existing methods across diverse modalities and pathologies. Our source code and trained models are openly accessible at: this https URL.
弱监督语义分割为训练体积医学影像的分割模型提供了一种标签高效的解决方案。然而,现有方法通常依赖于二维编码器,忽视了数据本身的三维特性。我们提出了一种混合Transformer-Mamba架构——TranSamba,专门用于捕捉弱监督体积医学分割中的三维上下文信息。TranSamba在标准视觉变压器主干网络的基础上增加了Cross-Plane Mamba块,利用状态空间模型的线性复杂度来实现相邻切片之间高效的信息交换。这种信息交换增强了由Transformer块计算出的片内成对自注意力机制,直接贡献于目标定位的注意图中。 TranSamba能够有效地建模体积数据,并且时间复杂度随着输入体深度呈线性增长,同时保持了批量处理中的常量内存使用率。在三个数据集上的广泛实验表明,TranSamba建立了新的最先进技术性能,在不同模式和病理条件下始终优于现有方法。 我们的源代码和训练模型可在以下网址公开访问:[此处插入实际URL]。
https://arxiv.org/abs/2512.10353
Weakly supervised semantic segmentation (WSSS) in histopathology relies heavily on classification backbones, yet these models often localize only the most discriminative regions and struggle to capture the full spatial extent of tissue structures. Vision-language models such as CONCH offer rich semantic alignment and morphology-aware representations, while modern segmentation backbones like SegFormer preserve fine-grained spatial cues. However, combining these complementary strengths remains challenging, especially under weak supervision and without dense annotations. We propose a prototype learning framework for WSSS in histopathological images that integrates morphology-aware representations from CONCH, multi-scale structural cues from SegFormer, and text-guided semantic alignment to produce prototypes that are simultaneously semantically discriminative and spatially coherent. To effectively leverage these heterogeneous sources, we introduce text-guided prototype initialization that incorporates pathology descriptions to generate more complete and semantically accurate pseudo-masks. A structural distillation mechanism transfers spatial knowledge from SegFormer to preserve fine-grained morphological patterns and local tissue boundaries during prototype learning. Our approach produces high-quality pseudo masks without pixel-level annotations, improves localization completeness, and enhances semantic consistency across tissue types. Experiments on BCSS-WSSS datasets demonstrate that our prototype learning framework outperforms existing WSSS methods while remaining computationally efficient through frozen foundation model backbones and lightweight trainable adapters.
在组织病理学中,弱监督语义分割(WSSS)主要依赖于分类骨干模型。然而,这些模型通常只能定位最具判别力的区域,并且难以捕捉到组织结构的整体空间范围。视觉-语言模型如CONCH提供了丰富的语义对齐和形态感知表示,而现代分割骨架如SegFormer则保留了精细的空间线索。但是,在弱监督条件下以及缺乏密集注释的情况下,将这些互补优势结合在一起仍然存在挑战。 我们提出了一种用于组织病理图像WSSS的原型学习框架,该框架整合了来自CONCH的形态感知表示、SegFormer的多尺度结构提示和文本引导语义对齐,以生成同时具有判别性和空间一致性的原型。为了有效利用这些异质资源,我们引入了基于病理描述的文本引导原型初始化方法,从而生成更加完整且语义准确的伪掩膜(pseudo-masks)。此外,一种结构蒸馏机制从SegFormer中传输空间知识,在原型学习过程中保留细粒度形态模式和局部组织边界。我们的方法能够在没有像素级注释的情况下生成高质量的伪掩膜,提高定位完整性,并增强不同组织类型之间的语义一致性。 在BCSS-WSSS数据集上的实验表明,与现有的WSSS方法相比,我们提出的原型学习框架不仅性能更优,而且由于冻结了基础模型骨干和轻量可训练适配器的应用,使其计算效率更高。
https://arxiv.org/abs/2512.10316
Weakly supervised semantic segmentation (WSSS) in histopathology seeks to reduce annotation cost by learning from image-level labels, yet it remains limited by inter-class homogeneity, intra-class heterogeneity, and the region-shrinkage effect of CAM-based supervision. We propose a simple and effective prototype-driven framework that leverages vision-language alignment to improve region discovery under weak supervision. Our method integrates CoOp-style learnable prompt tuning to generate text-based prototypes and combines them with learnable image prototypes, forming a dual-modal prototype bank that captures both semantic and appearance cues. To address oversmoothing in ViT representations, we incorporate a multi-scale pyramid module that enhances spatial precision and improves localization quality. Experiments on the BCSS-WSSS benchmark show that our approach surpasses existing state-of-the-art methods, and detailed analyses demonstrate the benefits of text description diversity, context length, and the complementary behavior of text and image prototypes. These results highlight the effectiveness of jointly leveraging textual semantics and visual prototype learning for WSSS in digital pathology.
弱监督语义分割(WSSS)在组织病理学中的目标是通过从图像级别标签学习来减少标注成本,但目前仍受限于类别间的同质性、类别内的异质性以及基于类激活映射(CAM)的监督导致的区域缩小效应。我们提出了一种简单而有效的原型驱动框架,该框架利用视觉-语言对齐以在弱监督下改进区域发现过程。我们的方法整合了CoOp风格的学习提示调整技术来生成文本基元,并将它们与可学习的图像基元相结合,形成了一个双模态基元库,能够捕捉语义和外观线索。为了缓解ViT表示中的过度平滑问题,我们引入了一个多尺度金字塔模块,以增强空间精度并提高定位质量。 在BCSS-WSSS基准上的实验表明,我们的方法超越了现有的最先进技术,并且详细的分析展示了文本描述多样性、上下文长度以及文本与图像基元互补行为的益处。这些结果突显了联合利用文字语义和视觉原型学习对于数字病理学中WSSS的有效性。
https://arxiv.org/abs/2512.10314
Weakly supervised oriented object detection (WS-OOD) has gained attention as a cost-effective alternative to fully supervised methods, providing both efficiency and high accuracy. Among weakly supervised approaches, horizontal bounding box (HBox)-supervised OOD stands out for its ability to directly leverage existing HBox annotations while achieving the highest accuracy under weak supervision settings. This paper introduces adaptive bounding box scaling and symmetry-prior-based orientation prediction, called ABBSPO, a framework for WS-OOD. Our ABBSPO addresses limitations of previous HBox-supervised OOD methods, which compare ground truth (GT) HBoxes directly with the minimum circumscribed rectangles of predicted RBoxes, often leading to inaccurate scale estimation. To overcome this, we propose: (i) Adaptive Bounding Box Scaling (ABBS), which appropriately scales GT HBoxes to optimize for the size of each predicted RBox, ensuring more accurate scale prediction; and (ii) a Symmetric Prior Angle (SPA) loss that exploits inherent symmetry of aerial objects for self-supervised learning, resolving issues in previous methods where learning collapses when predictions for all three augmented views (original, rotated, and flipped) are consistently incorrect. Extensive experimental results demonstrate that ABBSPO achieves state-of-the-art performance, outperforming existing methods.
弱监督定向目标检测(WS-OOD)作为一种成本效益高的完全监督方法的替代方案,因其效率和高精度而受到关注。在各种弱监督方法中,水平边界框(HBox)监督定向目标检测因能够直接利用现有的HBox注释,并在弱监督设置下达到最高准确性而脱颖而出。本文提出了一种名为ABBSPO的框架,该框架结合了自适应边界框缩放和基于对称先验的角度预测,以解决WS-OOD中的问题。 我们的ABBSPO方法解决了之前HBox监督定向目标检测方法中存在的限制,这些方法直接比较地面真实(GT)HBoxes与预测矩形框(RBoxes)的最小外接矩形,通常会导致不准确的比例估计。为了解决这个问题,我们提出了以下两个方案: (i) 自适应边界框缩放 (ABBS),该方案适当地调整GT HBox以优化每个预测RBox的大小,从而确保更精确的比例预测; (ii) 对称先验角度(SPA)损失,利用空中物体固有的对称性进行自监督学习,解决了先前方法中当三个增强视图(原始、旋转和翻转)的所有预测均不正确时导致的学习崩溃问题。 通过广泛的实验结果表明,ABBSPO在WS-OOD任务上实现了最先进的性能,并且超过了现有的所有方法。
https://arxiv.org/abs/2512.10031
Accurate segmentation of vertebral metastasis in CT is clinically important yet difficult to scale, as voxel-level annotations are scarce and both lytic and blastic lesions often resemble benign degenerative changes. We introduce a weakly supervised method trained solely on vertebra-level healthy/malignant labels, without any lesion masks. The method combines a Diffusion Autoencoder (DAE) that produces a classifier-guided healthy edit of each vertebra with pixel-wise difference maps that propose candidate lesion regions. To determine which regions truly reflect malignancy, we introduce Hide-and-Seek Attribution: each candidate is revealed in turn while all others are hidden, the edited image is projected back to the data manifold by the DAE, and a latent-space classifier quantifies the isolated malignant contribution of that component. High-scoring regions form the final lytic or blastic segmentation. On held-out radiologist annotations, we achieve strong blastic/lytic performance despite no mask supervision (F1: 0.91/0.85; Dice: 0.87/0.78), exceeding baselines (F1: 0.79/0.67; Dice: 0.74/0.55). These results show that vertebra-level labels can be transformed into reliable lesion masks, demonstrating that generative editing combined with selective occlusion supports accurate weakly supervised segmentation in CT.
在CT图像中对椎体转移瘤进行精确分割具有重要的临床意义,但这一过程难以规模化实施,因为像素级标注稀缺,并且溶骨性和成骨性病变往往与良性退行性改变相似。我们引入了一种仅基于椎体级别健康/恶性标签训练的弱监督方法,无需任何病灶掩码。该方法结合了扩散自动编码器(DAE),它根据分类器指导对每个椎体进行健康编辑,并生成像素级差异图来提出候选病变区域。为了确定哪些区域真正反映了恶性肿瘤特征,我们引入了“Hide-and-Seek Attribution”:逐个揭示每个候选区,同时隐藏所有其他区域,在用DAE将修改后的图像投影回数据流形后,通过潜在空间分类器量化被孤立的恶性成分贡献。得分较高的区域形成最终的溶骨性或成骨性分割结果。 在独立于放射科医生注释的数据集上,尽管没有病灶掩码监督,我们实现了出色的成骨/溶骨性能(F1: 0.91/0.85;Dice: 0.87/0.78),超过了基准方法(F1: 0.79/0.67;Dice: 0.74/0.55)。这些结果表明,椎体级别标签可以转换为可靠的病变掩码。结合生成编辑与选择性遮挡支持在CT图像中进行准确的弱监督分割。
https://arxiv.org/abs/2512.06849
Deploying video anomaly detection in practice is hampered by the scarcity and collection cost of real abnormal footage. We address this by training without any real abnormal videos while evaluating under the standard weakly supervised split, and we introduce PA-VAD, a generation-driven approach that learns a detector from synthesized pseudo-abnormal videos paired with real normal videos, using only a small set of real normal images to drive synthesis. For synthesis, we select class-relevant initial images with CLIP and refine textual prompts with a vision-language model to improve fidelity and scene consistency before invoking a video diffusion model. For training, we mitigate excessive spatiotemporal magnitude in synthesized anomalies by an domain-aligned regularized module that combines domain alignment and memory usage-aware updates. Extensive experiments show that our approach reaches 98.2% on ShanghaiTech and 82.5% on UCF-Crime, surpassing the strongest real-abnormal method on ShanghaiTech by +0.6% and outperforming the UVAD state-of-the-art on UCF-Crime by +1.9%. The results demonstrate that high-accuracy anomaly detection can be obtained without collecting real anomalies, providing a practical path toward scalable deployment.
在实践中部署视频异常检测时,由于真实异常视频的稀缺性和收集成本问题而面临挑战。为了解决这个问题,我们提出了一种方法,在训练时不使用任何真实的异常视频,并且评估是在标准弱监督数据集划分下进行的。为此,我们引入了PA-VAD(Pseudo-Anomaly Video Anomaly Detection),这是一种生成驱动的方法,通过从合成的伪异常视频和真实正常视频对中学习检测器来工作。此方法仅需使用少量的真实正常图像作为驱动力来进行合成。 在生成阶段,我们首先利用CLIP选择与特定类别相关的初始图像,并且使用视觉-语言模型优化文本提示以提高生成内容的保真度和场景一致性,在此基础上调用视频扩散模型进行具体生成过程。 为了训练我们的模型并减少合成异常中过度的空间时间量级问题,我们在PA-VAD中引入了一个结合领域对齐和内存感知更新策略的模块。这个域对准正则化模块旨在减轻由于数据不足而产生的不自然或过度放大的异常效果。 经过广泛的实验验证,在上海科技大学的数据集上我们的方法达到了98.2%的准确率,并在UCF-Crime数据集中实现了82.5%,分别比最强的真实异常方法在上海科技大学的数据集上的性能高出0.6%,以及在UCF-Crime数据集上的UVAD最新成果高出了1.9%。 这些结果表明,通过不依赖于收集真实的异常视频样本,仍可以实现高准确率的异常检测。这为大规模部署提供了一条切实可行的道路。
https://arxiv.org/abs/2512.06845
Shearography is an interferometric technique sensitive to surface displacement gradients, providing high sensitivity for detecting subsurface defects in safety-critical components. A key limitation to industrial adoption is the lack of high-quality annotated datasets, since manual labeling remains labor-intensive, subjective, and difficult to standardize. We introduce an automated workflow that generates defect annotations from shearography measurements using deep learning, producing high-resolution segmentation and bounding-box labels. Evaluation against expert-labeled data demonstrates sufficient accuracy to enable weakly supervised training, reducing manual effort and supporting scalable dataset creation for robust defect detection.
剪切光度学是一种干涉测量技术,对表面位移梯度敏感,因此能够以高灵敏度检测安全关键部件中的亚表面缺陷。该技术在工业应用中面临的一个主要限制是缺乏高质量的标注数据集,因为手动标注工作既费时又主观,并且难以标准化。我们介绍了一种自动工作流程,利用深度学习技术从剪切光度测量结果生成缺陷标注,能够产出高分辨率分割和边界框标签。通过与专家标注的数据进行对比评估,证明了其准确性足以支持弱监督训练的实施,从而减少手动劳动强度并为稳健的缺陷检测提供可扩展数据集创建的支持。
https://arxiv.org/abs/2512.06171
Weakly supervised semantic segmentation (WSSS) in histopathology reduces pixel-level labeling by learning from image-level labels, but it is hindered by inter-class homogeneity, intra-class heterogeneity, and CAM-induced region shrinkage (global pooling-based class activation maps whose activations highlight only the most distinctive areas and miss nearby class regions). Recent works address these challenges by constructing a clustering prototype bank and then refining masks in a separate stage; however, such two-stage pipelines are costly, sensitive to hyperparameters, and decouple prototype discovery from segmentation learning, limiting their effectiveness and efficiency. We propose a cluster-free, one-stage learnable-prototype framework with diversity regularization to enhance morphological intra-class heterogeneity coverage. Our approach achieves state-of-the-art (SOTA) performance on BCSS-WSSS, outperforming prior methods in mIoU and mDice. Qualitative segmentation maps show sharper boundaries and fewer mislabels, and activation heatmaps further reveal that, compared with clustering-based prototypes, our learnable prototypes cover more diverse and complementary regions within each class, providing consistent qualitative evidence for their effectiveness.
弱监督语义分割(WSSS)在组织病理学中通过从图像级别标签学习来减少像素级标注,但该方法受到类内异质性、类间同质性和CAM诱导的区域收缩等问题的影响。最近的研究通过构建聚类原型库并在独立阶段进行掩码细化来解决这些问题;然而,这类两阶段管道成本高,对超参数敏感,并且将原型发现与分割学习解耦,限制了其有效性和效率。我们提出了一种无需集群的一阶段可学习原型框架,并采用多样性正则化以增强形态类内异质性的覆盖范围。我们的方法在BCSS-WSSS数据集上达到了最先进的(SOTA)性能,在mIoU和mDice指标上超越了以前的方法。定性分割图显示出了更清晰的边界并且错误标注较少,而激活热图进一步揭示了相比于基于聚类的原型,我们的可学习原型能够覆盖每个类别内更多样化且互补的区域,提供了其有效性的持续证据。
https://arxiv.org/abs/2512.05922
Synthesizing high-fidelity frozen 3D scenes from monocular Mannequin-Challenge (MC) videos is a unique problem distinct from standard dynamic scene reconstruction. Instead of focusing on modeling motion, our goal is to create a frozen scene while strategically preserving subtle dynamics to enable user-controlled instant selection. To achieve this, we introduce a novel application of dynamic Gaussian splatting: the scene is modeled dynamically, which retains nearby temporal variation, and a static scene is rendered by fixing the model's time parameter. However, under this usage, monocular capture with sparse temporal supervision introduces artifacts like ghosting and blur for Gaussians that become unobserved or occluded at weakly supervised timestamps. We propose Splannequin, an architecture-agnostic regularization that detects two states of Gaussian primitives, hidden and defective, and applies temporal anchoring. Under predominantly forward camera motion, hidden states are anchored to their recent well-observed past states, while defective states are anchored to future states with stronger supervision. Our method integrates into existing dynamic Gaussian pipelines via simple loss terms, requires no architectural changes, and adds zero inference overhead. This results in markedly improved visual quality, enabling high-fidelity, user-selectable frozen-time renderings, validated by a 96% user preference. Project page: this https URL
合成高保真度的冻结3D场景,这是从单目Mannequin-Challenge(MC)视频中提取的独特问题,这个问题与标准动态场景重建不同。在我们的目标中,并非专注于建模运动本身,而是要创建一个静止场景,同时战略性地保留微妙的动力学现象,以实现用户控制的即时选择。为了实现这一目标,我们引入了一种新型的动态高斯点绘制应用:通过动态模型来保持附近的时间变化,并通过固定模型的时间参数来渲染静态场景。然而,在这种使用方式下,单目捕捉加上稀疏时间监督会在Gaussians(在弱监督时间戳处未被观察到或被遮挡)中引入如幽灵和模糊等伪影。 我们提出了Splannequin——一种与架构无关的正则化方法,它可以检测高斯元素的两种状态:隐藏态和缺陷态,并应用时间锚定。在主要向前移动摄像头的情况下,隐藏的状态会被锚定到其最近且观察良好的过去状态,而缺陷的状态则被锚定到未来监督更强烈的状态。 我们的方法通过简单的损失项融入现有的动态高斯管道中,无需架构更改,并且不会增加任何推理开销。这导致了显著的视觉质量改进,能够实现高保真度、用户可选择的时间冻结渲染,经验证据是96%的用户偏好。项目页面:[此处插入链接](请根据实际情况添加正确的URL)。
https://arxiv.org/abs/2512.05113
Multi-view crowd counting can effectively mitigate occlusion issues that commonly arise in single-image crowd counting. Existing deep-learning multi-view crowd counting methods project different camera view images onto a common space to obtain ground-plane density maps, requiring abundant and costly crowd annotations and camera calibrations. Hence, calibration-free methods are proposed that do not require camera calibrations and scene-level crowd annotations. However, existing calibration-free methods still require expensive image-level crowd annotations for training the single-view counting module. Thus, in this paper, we propose a weakly-supervised calibration-free multi-view crowd counting method (WSCF-MVCC), directly using crowd count as supervision for the single-view counting module rather than density maps constructed from crowd annotations. Instead, a self-supervised ranking loss that leverages multi-scale priors is utilized to enhance the model's perceptual ability without additional annotation costs. What's more, the proposed model leverages semantic information to achieve a more accurate view matching and, consequently, a more precise scene-level crowd count estimation. The proposed method outperforms the state-of-the-art methods on three widely used multi-view counting datasets under weakly supervised settings, indicating that it is more suitable for practical deployment compared with calibrated methods. Code is released in this https URL.
多视角人群计数可以有效缓解单图像人群计数中常见的遮挡问题。现有的基于深度学习的多视角人群计数方法将不同摄像头视角的图像投影到一个公共空间,以获取地面密度图,这需要大量的且成本高昂的人群标注和相机校准工作。因此,有人提出了无需相机校准和场景级人群注释的方法(即无校准方法)。然而,现有的无校准方法仍然需要昂贵的图像级别的人群注释来训练单视角计数模块。为此,在本文中,我们提出了一种弱监督下的无校准多视角人群计数方法(WSCF-MVCC),直接使用人群数量作为单视角计数模块的监督信号,而不是基于人群标注构建的密度图。相反,该方法利用了自监督排序损失来增强模型的感知能力,并且无需额外的注释成本,该损失函数利用多尺度先验信息。此外,所提出的模型还借助语义信息实现了更准确的视角匹配,并因此得出了更为精确的场景级别人群计数估计。在三个广泛使用的多视图计数组数据集上,在弱监督设置下,提出的方法超过了现有方法的表现,这表明与校准方法相比,该方法更适合实际部署。代码发布在此链接:[此URL](请将此处的实际内容替换为正确的URL)。
https://arxiv.org/abs/2512.02359
In recent years, Contrastive Language-Image Pretraining (CLIP) has been widely applied to Weakly Supervised Semantic Segmentation (WSSS) tasks due to its powerful cross-modal semantic understanding capabilities. This paper proposes a novel Semantic and Spatial Rectification (SSR) method to address the limitations of existing CLIP-based weakly supervised semantic segmentation approaches: over-activation in non-target foreground regions and background areas. Specifically, at the semantic level, the Cross-Modal Prototype Alignment (CMPA) establishes a contrastive learning mechanism to enforce feature space alignment across modalities, reducing inter-class overlap while enhancing semantic correlations, to rectify over-activation in non-target foreground regions effectively; at the spatial level, the Superpixel-Guided Correction (SGC) leverages superpixel-based spatial priors to precisely filter out interference from non-target regions during affinity propagation, significantly rectifying background over-activation. Extensive experiments on the PASCAL VOC and MS COCO datasets demonstrate that our method outperforms all single-stage approaches, as well as more complex multi-stage approaches, achieving mIoU scores of 79.5% and 50.6%, respectively.
近年来,对比语言-图像预训练(CLIP)由于其强大的跨模态语义理解能力,在弱监督语义分割(WSSS)任务中得到了广泛应用。本文提出了一种新颖的语义和空间矫正(SSR)方法,以解决现有基于 CLIP 的弱监督语义分割方法中存在的问题:非目标前景区域和背景区域中的过度激活现象。 具体来说,在语义层面,跨模态原型对齐(CMPA)建立了一个对比学习机制,强制不同模式之间的特征空间对齐,减少类间重叠并增强语义相关性,从而有效矫正非目标前景区域的过度激活;在空间层面上,超像素引导校正(SGC)利用基于超像素的空间先验知识,在亲和力传播过程中精确过滤掉非目标区域的影响,显著减少了背景的过度激活。 通过 PASCAL VOC 和 MS COCO 数据集上的大量实验表明,我们的方法超越了所有单阶段方法,并且在性能上优于更为复杂的多阶段方法,分别取得了 79.5% 和 50.6% 的 mIoU 分数。
https://arxiv.org/abs/2512.01701
Analyzing underwater fish imagery is critical for ecological monitoring but remains difficult due to visual degradation and costly annotations. We introduce FishDetector-R1, a unified MLLM-based framework for fish detection, segmentation, and counting under weak supervision. On the DeepFish dataset, our framework achieves substantial gains over baselines, improving AP by 20% and mIoU by 10%, while reducing MAE by 30% and GAME by 35%. These improvements stem from two key components: a novel detect-to-count prompt that enforces spatially consistent detections and counts, and Reinforcement Learning from Verifiable Reward (RLVR) with a complementary scalable paradigm leveraging sparse point labels. Ablation studies further validate the effectiveness of this reward design. Moreover, the improvement generalizes well to other underwater datasets, confirming strong cross-domain robustness. Overall, FishDetector-R1 provides a reliable and scalable solution for accurate marine visual understanding via weak supervision. The project page for FishDetector-R1 is this https URL.
分析水下鱼类图像对于生态监测至关重要,但由于视觉退化和高昂的标注成本,这一过程依然充满挑战。为此,我们提出了FishDetector-R1,这是一个基于多模态大型语言模型(MLLM)的统一框架,用于在弱监督条件下进行鱼类检测、分割和计数。 在DeepFish数据集上,我们的框架相较于基线方法实现了显著改进:平均精度(AP)提高了20%,平均交并比(mIoU)提升了10%,绝对平均误差(MAE)减少了30%,以及全局平均误差(GAME)降低了35%。这些改进主要归功于两个关键组成部分: - 一种新颖的检测至计数提示,该提示强制执行空间一致的检测和计数; - 基于可验证奖励的强化学习(RLVR),并采用了一种补充的大规模范式,利用稀疏点标签。 消融实验进一步证实了这种奖赏设计的有效性。此外,改进还很好地推广到了其他水下数据集上,确认了强大的跨域鲁棒性。 总体而言,FishDetector-R1为通过弱监督实现准确的海洋视觉理解提供了一个可靠且可扩展的解决方案。有关FishDetector-R1项目的更多信息,请访问此项目页面:[https URL](请将链接替换为您实际提供的URL)。
https://arxiv.org/abs/2512.05996
Monocular 3D object detection is a fundamental yet challenging task in 3D scene understanding. Existing approaches heavily depend on supervised learning with extensive 3D annotations, which are often acquired from LiDAR point clouds through labor-intensive labeling processes. To tackle this problem, we propose VSRD++, a novel weakly supervised framework for monocular 3D object detection that eliminates the reliance on 3D annotations and leverages neural-field-based volumetric rendering with weak 2D supervision. VSRD++ consists of a two-stage pipeline: multi-view 3D autolabeling and subsequent monocular 3D detector training. In the multi-view autolabeling stage, object surfaces are represented as signed distance fields (SDFs) and rendered as instance masks via the proposed instance-aware volumetric silhouette rendering. To optimize 3D bounding boxes, we decompose each instance's SDF into a cuboid SDF and a residual distance field (RDF) that captures deviations from the cuboid. To address the geometry inconsistency commonly observed in volume rendering methods applied to dynamic objects, we model the dynamic objects by including velocity into bounding box attributes as well as assigning confidence to each pseudo-label. Moreover, we also employ a 3D attribute initialization module to initialize the dynamic bounding box parameters. In the monocular 3D object detection phase, the optimized 3D bounding boxes serve as pseudo labels for training monocular 3D object detectors. Extensive experiments on the KITTI-360 dataset demonstrate that VSRD++ significantly outperforms existing weakly supervised approaches for monocular 3D object detection on both static and dynamic scenes. Code is available at this https URL
单目三维物体检测是三维场景理解中的一个基本但具有挑战性的任务。现有的方法依赖于监督学习,需要大量的三维标注数据,这些标注通常通过耗时的激光雷达点云标记过程获得。为了解决这个问题,我们提出了一种名为VSRD++的新颖弱监督框架,用于单目三维物体检测,该框架消除了对3D注释的依赖,并利用基于神经场的体渲染技术,仅使用2D弱标注进行训练。 VSRD++包含一个两阶段管道:多视图3D自动标记和随后的单目3D检测器训练。在多视图自动标签阶段中,物体表面被表示为符号距离场(SDFs),并通过提出的实例感知体轮廓渲染方法将其渲染成实例掩模。为了优化三维边界框,我们将每个实例的SDF分解为一个立方体SDF和一个残差距离场(RDF),后者捕捉偏离立方体的距离偏差。 此外,针对体积渲染方法应用于动态对象时常见的几何不一致性问题,我们通过将速度纳入边界盒属性以及为每一个伪标签分配置信度来对动态物体进行建模。同时,我们也采用了3D属性初始化模块来初始化动态边框参数。 在单目三维物体检测阶段中,优化后的三维边界框被用作训练单目三维物体检测器的伪标签。在KITTI-360数据集上的广泛实验表明,VSRD++显著优于现有的弱监督方法,在静态和动态场景下的性能均领先于其他方法。 代码可以在提供的链接中获取(原文中的“this https URL”)。
https://arxiv.org/abs/2512.01178
Micro-facial expressions are brief and involuntary facial movements that reflect genuine emotional states. While most prior work focuses on classifying discrete micro-expression categories, far fewer studies address the continuous evolution of intensity over time. Progress in this direction is limited by the lack of frame-level intensity labels, which makes fully supervised regression impractical. We propose a unified framework for continuous micro-expression intensity estimation using only weak temporal labels (onset, apex, offset). A simple triangular prior converts sparse temporal landmarks into dense pseudo-intensity trajectories, and a lightweight temporal regression model that combines a ResNet18 encoder with a bidirectional GRU predicts frame-wise intensity directly from image sequences. The method requires no frame-level annotation effort and is applied consistently across datasets through a single preprocessing and temporal alignment pipeline. Experiments on SAMM and CASME II show strong temporal agreement with the pseudo-intensity trajectories. On SAMM, the model reaches a Spearman correlation of 0.9014 and a Kendall correlation of 0.7999, outperforming a frame-wise baseline. On CASME II, it achieves up to 0.9116 and 0.8168, respectively, when trained without the apex-ranking term. Ablation studies confirm that temporal modeling and structured pseudo labels are central to capturing the rise-apex-fall dynamics of micro-facial movements. To our knowledge, this is the first unified approach for continuous micro-expression intensity estimation using only sparse temporal annotations.
微面部表情是短暂且不由自主的面部运动,反映的是真实的情感状态。以往的研究大多集中在离散的微表情分类上,而关注其随时间变化连续强度演变的研究则较少。由于缺乏帧级别的强度标签,使得完全监督下的回归方法难以实施,限制了这一领域的进展。我们提出了一种统一框架,仅使用稀疏的时间标签(起始点、峰值点和结束点)来进行连续微表情强度估计。 该方法包含两个关键部分:首先,通过简单的三角形先验将稀疏的时间地标转换为密集的伪强度轨迹;其次,结合 ResNet18 编码器与双向 GRU 的轻量级时间回归模型直接从图像序列中预测帧级别的强度。这种方法不需要对每个帧进行标注,并且可以通过统一的数据预处理和时间对齐管道在不同的数据集中一致地应用。 实验结果表明,在 SAMM 和 CASME II 数据集上,伪强度轨迹与真实值的时间一致性得到了显著提升。具体而言,在 SAMM 上,模型达到了 0.9014 的斯皮尔曼相关系数以及 0.7999 的肯德尔相关系数,并且优于帧级别基线模型的表现;在 CASME II 数据集上,当训练过程中不包括峰值排序项时,分别取得了最高为 0.9116 和 0.8168 的表现。消融研究表明时间建模和结构化伪标签对于捕捉微面部运动的上升-顶峰-下降动态至关重要。 据我们所知,这是第一个仅使用稀疏的时间标注来实现连续微表情强度估计的统一方法。
https://arxiv.org/abs/2512.01145
Cutmix-based data augmentation, which uses a cut-and-paste strategy, has shown remarkable generalization capabilities in deep learning. However, existing methods primarily consider global semantics with image-level constraints, which excessively reduces attention to the discriminative local context of the class and leads to a performance improvement bottleneck. Moreover, existing methods for generating augmented samples usually involve cutting and pasting rectangular or square regions, resulting in a loss of object part information. To mitigate the problem of inconsistency between the augmented image and the generated mixed label, existing methods usually require double forward propagation or rely on an external pre-trained network for object centering, which is inefficient. To overcome the above limitations, we propose LGCOAMix, an efficient context-aware and object-part-aware superpixel-based grid blending method for data augmentation. To the best of our knowledge, this is the first time that a label mixing strategy using a superpixel attention approach has been proposed for cutmix-based data augmentation. It is the first instance of learning local features from discriminative superpixel-wise regions and cross-image superpixel contrasts. Extensive experiments on various benchmark datasets show that LGCOAMix outperforms state-of-the-art cutmix-based data augmentation methods on classification tasks, {and weakly supervised object location on CUB200-2011.} We have demonstrated the effectiveness of LGCOAMix not only for CNN networks, but also for Transformer networks. Source codes are available at this https URL.
基于Cutmix的数据增强方法,利用剪切和粘贴策略,在深度学习中表现出卓越的泛化能力。然而,现有的方法主要关注全局语义与图像级别的约束条件,这极大地减少了对类别鉴别性局部上下文的关注,并导致性能改进瓶颈。此外,现有生成增强样本的方法通常涉及切割和粘贴矩形或正方形区域,从而导致物体部分信息丢失。为解决增强图像与其生成的混合标签之间的一致性问题,现有的方法通常需要双向前传播或者依赖外部预训练网络进行目标定位,这在效率上显得不足。 为了克服上述限制,我们提出了LGCOAMix,这是一种高效且具备上下文感知和对象部件感知能力的超像素网格融合数据增强方法。据我们所知,这是首次提出使用超像素注意机制来改进基于Cutmix的数据增强策略的方法。它也是第一次尝试从有区分性的超像素级区域中学习局部特征并利用跨图像的超像素对比度。 在多个基准数据集上的广泛实验表明,LGCOAMix在分类任务上优于现有的最先进的基于Cutmix的数据增强方法,并且在CUB200-2011弱监督目标定位方面也表现出色。我们已经证明了LGCOAMix不仅适用于CNN网络,还适用于Transformer网络。源代码可以在[提供链接的位置]获取。 请注意,上述文本中的“{and weakly supervised object location on CUB200-2011.}”可能是一个格式或引用错误,我假设您希望这段信息包含在最终的翻译中,并因此将其直接译出。如果有其他特定需求,请告知调整。
https://arxiv.org/abs/2512.00130
Weakly supervised learning has emerged as a practical alternative to fully supervised learning when complete and accurate labels are costly or infeasible to acquire. However, many existing methods are tailored to specific supervision patterns -- such as positive-unlabeled (PU), unlabeled-unlabeled (UU), complementary-label (CLL), partial-label (PLL), or similarity-unlabeled annotations -- and rely on post-hoc corrections to mitigate instability induced by indirect supervision. We propose a principled, unified framework that bypasses such post-hoc adjustments by directly formulating a stable surrogate risk grounded in the structure of weakly supervised data. The formulation naturally subsumes diverse settings -- including PU, UU, CLL, PLL, multi-class unlabeled, and tuple-based learning -- under a single optimization objective. We further establish a non-asymptotic generalization bound via Rademacher complexity that clarifies how supervision structure, model capacity, and sample size jointly govern performance. Beyond this, we analyze the effect of class-prior misspecification on the bound, deriving explicit terms that quantify its impact, and we study identifiability, giving sufficient conditions -- most notably via supervision stratification across groups -- under which the target risk is recoverable. Extensive experiments show consistent gains across class priors, dataset scales, and class counts -- without heuristic stabilization -- while exhibiting robustness to overfitting.
https://arxiv.org/abs/2511.22823
Image manipulation localization (IML) faces a fundamental trade-off between minimizing annotation cost and achieving fine-grained localization accuracy. Existing fully-supervised IML methods depend heavily on dense pixel-level mask annotations, which limits scalability to large datasets or real-world this http URL contrast, the majority of existing weakly-supervised IML approaches are based on image-level labels, which greatly reduce annotation effort but typically lack precise spatial localization. To address this dilemma, we propose BoxPromptIML, a novel weakly-supervised IML framework that effectively balances annotation cost and localization performance. Specifically, we propose a coarse region annotation strategy, which can generate relatively accurate manipulation masks at lower cost. To improve model efficiency and facilitate deployment, we further design an efficient lightweight student model, which learns to perform fine-grained localization through knowledge distillation from a fixed teacher model based on the Segment Anything Model (SAM). Moreover, inspired by the human subconscious memory mechanism, our feature fusion module employs a dual-guidance strategy that actively contextualizes recalled prototypical patterns with real-time observational cues derived from the input. Instead of passive feature extraction, this strategy enables a dynamic process of knowledge recollection, where long-term memory is adapted to the specific context of the current image, significantly enhancing localization accuracy and robustness. Extensive experiments across both in-distribution and out-of-distribution datasets show that BoxPromptIML outperforms or rivals fully-supervised models, while maintaining strong generalization, low annotation cost, and efficient deployment characteristics.
https://arxiv.org/abs/2511.20359
Weakly supervised semantic segmentation (WSSS) must learn dense masks from noisy, under-specified cues. We revisit the SegFormer decoder and show that three small, synergistic changes make weak supervision markedly more effective-without altering the MiT backbone or relying on heavy post-processing. Our method, CrispFormer, augments the decoder with: (1) a boundary branch that supervises thin object contours using a lightweight edge head and a boundary-aware loss; (2) an uncertainty-guided refiner that predicts per-pixel aleatoric uncertainty and uses it to weight losses and gate a residual correction of the segmentation logits; and (3) a dynamic multi-scale fusion layer that replaces static concatenation with spatial softmax gating over multi-resolution features, optionally modulated by uncertainty. The result is a single-pass model that preserves crisp boundaries, selects appropriate scales per location, and resists label noise from weak cues. Integrated into a standard WSSS pipeline (seed, student, and EMA relabeling), CrispFormer consistently improves boundary F-score, small-object recall, and mIoU over SegFormer baselines trained on the same seeds, while adding minimal compute. Our decoder-centric formulation is simple to implement, broadly compatible with existing SegFormer variants, and offers a reproducible path to higher-fidelity masks from image-level supervision.
https://arxiv.org/abs/2511.19765
Autonomous driving systems benefit from high-definition (HD) maps that provide critical information about road infrastructure. The online construction of HD maps offers a scalable approach to generate local maps from on-board sensors. However, existing methods typically rely on costly 3D map annotations for training, which limits their generalization and scalability across diverse driving environments. In this work, we propose MapRF, a weakly supervised framework that learns to construct 3D maps using only 2D image labels. To generate high-quality pseudo labels, we introduce a novel Neural Radiance Fields (NeRF) module conditioned on map predictions, which reconstructs view-consistent 3D geometry and semantics. These pseudo labels are then iteratively used to refine the map network in a self-training manner, enabling progressive improvement without additional supervision. Furthermore, to mitigate error accumulation during self-training, we propose a Map-to-Ray Matching strategy that aligns map predictions with camera rays derived from 2D labels. Extensive experiments on the Argoverse 2 and nuScenes datasets demonstrate that MapRF achieves performance comparable to fully supervised methods, attaining around 75% of the baseline while surpassing several approaches using only 2D labels. This highlights the potential of MapRF to enable scalable and cost-effective online HD map construction for autonomous driving.
https://arxiv.org/abs/2511.19527