Reliable uncertainty estimation is critical for medical image segmentation, where automated contours feed downstream quantification and clinical decision support. Many strong uncertainty methods require repeated inference, while efficient single-forward-pass alternatives often provide weaker failure ranking or rely on restrictive feature-space assumptions. We present $\textbf{SegWithU}$, a post-hoc framework that augments a frozen pretrained segmentation backbone with a lightweight uncertainty head. SegWithU taps intermediate backbone features and models uncertainty as perturbation energy in a compact probe space using rank-1 posterior probes. It produces two voxel-wise uncertainty maps: a calibration-oriented map for probability tempering and a ranking-oriented map for error detection and selective prediction. Across ACDC, BraTS2024, and LiTS, SegWithU is the strongest and most consistent single-forward-pass baseline, achieving AUROC/AURC of $0.9838/2.4885$, $0.9946/0.2660$, and $0.9925/0.8193$, respectively, while preserving segmentation quality. These results suggest that perturbation-based uncertainty modeling is an effective and practical route to reliability-aware medical segmentation. Source code is available at this https URL.
https://arxiv.org/abs/2604.15271
Detecting slow-moving landslides directly from wrapped Interferometric Synthetic Aperture Radar (InSAR) interferograms is crucial for efficient geohazard monitoring, yet it remains fundamentally challenged by severe phase ambiguity and complex coherence noise. While the Segment Anything Model (SAM) offers a powerful foundation for segmentation, its direct transfer to wrapped phase data is hindered by a profound spectral domain shift, which suppresses the high-frequency fringes essential for boundary delineation. To bridge this gap, we propose WILD-SAM, a novel parameter-efficient fine-tuning framework specifically designed to adapt SAM for high-precision landslide detection on wrapped interferograms. Specifically, the architecture integrates a Phase-Aware Mixture-of-Experts (PA-MoE) Adapter into the frozen encoder to align spectral distributions and introduces a Wavelet-Guided Subband Enhancement (WGSE) strategy to generate frequency-aware dense prompts. The PA-MoE Adapter exploits a dynamic routing mechanism across heterogeneous convolutional experts to adaptively aggregate multi-scale spectral-textural priors, effectively aligning the distribution discrepancy between natural images and interferometric phase data. Meanwhile, the WGSE strategy leverages discrete wavelet transforms to explicitly disentangle high-frequency subbands and refine directional phase textures, injecting these structural cues as dense prompts to ensure topological integrity along sharp landslide boundaries. Extensive experiments on the ISSLIDE and ISSLIDE+ benchmarks demonstrate that WILD-SAM achieves state-of-the-art performance, significantly outperforming existing methods in both target completeness and contour fidelity.
https://arxiv.org/abs/2604.14540
Semantic Multi-Object Tracking (SMOT) extends multi-object tracking with semantic outputs such as video summaries, instance-level captions, and interaction labels, aiming to move from trajectories to human-interpretable descriptions of dynamic scenes. Existing SMOT systems are trained end-to-end, coupling progress to expensive supervision, limiting the ability to rapidly adapt to new foundation models and new interactions. We propose TF-SMOT, a training-free SMOT pipeline that composes pretrained components for detection, mask-based tracking, and video-language generation. TF-SMOT combines D-FINE and the promptable SAM2 segmentation tracker to produce temporally consistent tracklets, uses contour grounding to generate video summaries and instance captions with InternVideo2.5, and aligns extracted interaction predicates to BenSMOT WordNet synsets via gloss-based semantic retrieval with LLM disambiguation. On BenSMOT, TF-SMOT achieves state-of-the-art tracking performance within the SMOT setting and improves summary and caption quality compared to prior art. Interaction recognition, however, remains challenging under strict exact-match evaluation on the fine-grained and long-tailed WordNet label space; our analysis and ablations indicate that semantic overlap and label granularity substantially affect measured performance.
https://arxiv.org/abs/2604.14074
Noncontact exfiltration of electronic screen content poses a security challenge, with side-channel incursions as the principal vector. We introduce an optical projection side-channel paradigm that confronts two core instabilities: (i) the near-singular Jacobian spectrum of projection mapping breaches Hadamard stability, rendering inversion hypersensitive to perturbations; (ii) irreversible compression in light transport obliterates global semantic cues, magnifying reconstruction ambiguity. Exploiting passive speckle patterns formed by diffuse reflection, our Irradiance Robust Radiometric Inversion Network (IR4Net) fuses a Physically Regularized Irradiance Approximation (PRIrr-Approximation), which embeds the radiative transfer equation in a learnable optimizer, with a contour-to-detail cross-scale reconstruction mechanism that arrests noise propagation. Moreover, an Irreversibility Constrained Semantic Reprojection (ICSR) module reinstates lost global structure through context-driven semantic mapping. Evaluated across four scene categories, IR4Net achieves fidelity beyond competing neural approaches while retaining resilience to illumination perturbations.
https://arxiv.org/abs/2604.13419
Proton therapy offers superior organ-at-risk sparing but is highly sensitive to anatomical changes, making accurate deformable image registration (DIR) across longitudinal CT scans essential. Conventional DIR methods are often too slow for emerging online adaptive workflows, while existing deep learning-based approaches are primarily designed for generic benchmarks and underutilize clinically relevant information beyond images. To address this gap, we propose a clinically scalable coarse-to-fine deformable registration framework that integrates multimodal information from the proton radiotherapy workflow to accommodate diverse clinical scenarios. The model employs dual CNN-based encoders for hierarchical feature extraction and a transformer-based decoder to progressively refine deformation fields. Beyond CT intensities, clinically critical priors, including target and organ-at-risk contours, dose distributions, and treatment planning text, are incorporated through anatomy- and risk-guided attention, text-conditioned feature modulation, and foreground-aware optimization, enabling anatomically focused and clinically informed deformation estimation. We evaluate the proposed framework on a large-scale proton therapy DIR dataset comprising 1,222 paired planning and repeat CT scans across multiple anatomical regions and disease types. Extensive experiments demonstrate consistent improvements over state-of-the-art methods, enabling fast and robust clinically meaningful registration.
https://arxiv.org/abs/2604.13397
Humans readily recognize objects from sparse line drawings, a capacity that appears early in development and persists across cultures, suggesting neural rather than purely learned origins. Yet the computational mechanism by which the brain transforms high-level semantic knowledge into low-level visual symbols remains poorly understood. Here we propose that ancient pictographic writing emerged from the brain's intrinsic tendency to compress visual input into stable, boundary-based abstractions. We construct a biologically inspired digital twin of the visual hierarchy that encodes an image into low-level features, generates a contour sketch, and iteratively refines it through top-down feedback guided by semantic representations, mirroring the feedforward and recurrent architecture of the human visual cortex. The resulting symbols bear striking structural resemblance to early pictographs across culturally distant writing systems, including Egyptian hieroglyphs, Chinese oracle bone characters, and proto-cuneiform, and offer candidate interpretations for undeciphered scripts. Our findings support a neuro-computational origin of pictographic writing and establish a framework in which AI can recapitulate the cognitive processes by which humans first externalized perception into symbols.
https://arxiv.org/abs/2604.12865
How to describe the shape of a melodic phrase? Scholars have often relied on typologies with a small set of contour types. We question their adequacy: we find no evidence that phrase contours cluster into discrete types, neither in German or Chinese folksongs, nor in Gregorian chant. The test for clustering we propose applies the dist-dip test of multimodality after a UMAP dimensionality reduction. The test correctly identifies clustering in a synthetic dataset, but not in actual phrase contours. These results raise problems for discrete typologies. In particular, type frequencies may be unreliable, as we see with Huron's typology. We also show how a recent finding of four contour shapes may be an artefact of the analysis. Our findings suggest that melodic contour is best seen as a continuous phenomenon.
https://arxiv.org/abs/2604.13119
While Vision-Language Models (VLMs) demonstrate remarkable zero-shot recognition capabilities across a diverse spectrum of multimodal tasks, it yet remains an open question whether these architectures genuinely comprehend geometric structure or merely exploit RGB textures and contextual priors as statistical shortcuts. Existing evaluations fail to isolate this mechanism, conflating semantic reasoning with texture mapping and relying on imprecise annotations that inadvertently leak environmental cues. To address this gap, we introduce \textbf{BareBones}, a zero-shot benchmark designed to stress-test pure geometric shape comprehension. We curate pixel-level silhouettes of geometrically distinct classes across six datasets: five established segmentation sources (ImageNet-S, DIS5K, ThinObject5K, PASCAL VOC, CUB-200) and our novel flagship collection, WTP-Bench, establishing a noise-free geometric taxonomy. WTP-Bench is an extreme, fine-grained visual puzzle that forces models to identify inter-class geometric concepts from boundary contours alone. Our evaluation of 26 state-of-the-art proprietary and open-weight VLMs (\eg, GPT-4.1, Gemini, Claude Sonnet 4.5, LLaVA) reveals a consistent, severe performance collapse under RGB deprivation, a phenomenon we term the \textit{Texture Bias Cliff}. By documenting universal structural blindspots, BareBones establishes a rigorous yardstick for genuine geometric grounding.
https://arxiv.org/abs/2604.10528
Remote sensing infrared image super-resolution aims to recover sharper thermal observations from low-resolution inputs while preserving target contours, scene layout, and radiometric stability. Unlike visible-image super-resolution, thermal imagery is weakly textured and more sensitive to unstable local sharpening, which makes complementary local and global modeling especially important. This paper presents our solution to the NTIRE 2026 Infrared Image Super-Resolution Challenge, a dual-branch system that combines a HAT-L branch and a MambaIRv2-L branch. The inference pipeline applies test-time local conversion on HAT, eight-way self-ensemble on MambaIRv2, and fixed equal-weight image-space fusion. We report both the official challenge score and a reproducible evaluation on 12 synthetic times-four thermal samples derived from Caltech Aerial RGB-Thermal, on which the fused output outperforms either single branch in PSNR, SSIM, and the overall Score. The results suggest that infrared super-resolution benefits from explicit complementarity between locally strong transformer restoration and globally stable state-space modeling.
https://arxiv.org/abs/2604.10112
Annotation-free skin lesion segmentation is attractive for low-resource dermoscopic deployment. However, its performance remains constrained by three coupled challenges: noisy pseudo-label supervision, unstable transfer under limited target-domain data, and boundary probability under-confidence. Most existing annotation-free methods primarily focus on pseudo-label denoising. In contrast, the effect of compressed boundary probabilities on final mask quality has received less explicit attention, although it directly affects contour completeness and cannot be adequately corrected by global threshold adjustment alone. To address this issue, we propose BPC-Net, a boundary probability calibration framework for annotation-free skin lesion segmentation. The core of the framework is Gaussian Probability Smoothing (GPS), which performs localized probability-space calibration before thresholding to recover under-confident lesion boundaries without inducing indiscriminate foreground expansion. To support this calibration under noisy pseudo-supervision and cross-domain transfer, we further incorporate two auxiliary designs: a feature-decoupled decoder that separately handles context suppression, detail recovery, and boundary refinement, and an interaction-branch adaptation strategy that updates only the pseudo-label interaction branch while preserving the deployed image-only segmentation path. Under a strictly annotation-free protocol, no manual masks are used during training or target-domain adaptation, and validation labels, when available, are used only for final operating-point selection. Experiments on ISIC-2017, ISIC-2018, and PH2 show that the proposed framework achieves state-of-the-art performance among published unsupervised methods, reaching a macro-average Dice coefficient and Jaccard index of 85.80\% and 76.97\%, respectively, while approaching supervised reference performance on PH2.
无标注皮肤病变分割因其适用于低资源皮肤镜部署而备受关注。然而,其性能仍受限于三个相互关联的挑战:噪声伪标签监督、有限目标域数据下的不稳定迁移,以及边界概率欠置信。大多数现有无标注方法主要聚焦于伪标签去噪。相比之下,压缩的边界概率对最终掩码质量的影响较少受到明确关注——尽管它直接影响轮廓完整性,且无法仅通过全局阈值调整充分修正。针对此问题,我们提出BPC-Net,一种面向无标注皮肤病变分割的边界概率校准框架。该框架的核心是高斯概率平滑(GPS),它在阈值化之前执行局部概率空间校准,以恢复欠置信的病变边界,同时避免 indiscriminate 前景扩张。为在噪声伪监督和跨域迁移下支持此校准,我们进一步引入两种辅助设计:一种分别处理上下文抑制、细节恢复和边界细化的特征解耦解码器,以及一种仅更新伪标签交互分支同时保留部署的纯图像分割路径的交互分支自适应策略。在严格的无标注协议下,训练和目标域适应过程中均不使用人工标注掩码,验证标签(若可用)仅用于最终操作点选择。在ISIC-2017、ISIC-2018和PH2数据集上的实验表明,所提框架在已发表的无监督方法中达到最先进性能,宏平均Dice系数和Jaccard指数分别达到85.80%和76.97%,同时在PH2上接近有监督参考性能。
https://arxiv.org/abs/2604.05594
Precise segmentation of objects with highly similar shapes remains a challenging problem in dense prediction, especially in scenarios with ambiguous boundaries, overlapping instances, and weak inter-instance visual differences. While conventional segmentation models are effective at localizing object regions, they often lack the discriminative capacity required to reliably distinguish a target object from morphologically similar distractors. In this work, we study fine-grained object segmentation from an identity-aware perspective and propose Identity-Aware U-Net (IAU-Net), a unified framework that jointly models spatial localization and instance discrimination. Built upon a U-Net-style encoder-decoder architecture, our method augments the segmentation backbone with an auxiliary embedding branch that learns discriminative identity representations from high-level features, while the main branch predicts pixel-accurate masks. To enhance robustness in distinguishing objects with near-identical contours or textures, we further incorporate triplet-based metric learning, which pulls target-consistent embeddings together and separates them from hard negatives with similar morphology. This design enables the model to move beyond category-level segmentation and acquire a stronger capability for precise discrimination among visually similar objects. Experiments on benchmarks including cell segmentation demonstrate promising results, particularly in challenging cases involving similar contours, dense layouts, and ambiguous boundaries.
https://arxiv.org/abs/2604.09702
Many modern video-based human action recognition (HAR) approaches use 2D skeleton as the intermediate representation in their prediction pipelines. Despite overall encouraging results, these approaches still struggle in many common scenes, mainly because the skeleton does not capture critical action-related information pertaining to the depth of the joints, contour of the human body, and interaction between the human and objects. To address this, we propose an effective approach to augment skeleton with a representation capturing action-related information in the pipeline of HAR. The representation, termed Scale-Body-Flow (SBF), consists of three distinct components, namely a scale map volume given by the scale (and hence depth information) of each joint, a body map outlining the human subject, and a flow map indicating human-object interaction given by pixel-wise optical flow values. To predict SBF, we further present SFSNet, a novel segmentation network supervised by the skeleton and optical flow without extra annotation overhead beyond the existing skeleton extraction. Extensive experiments across different datasets demonstrate that our pipeline based on SBF and SFSNet achieves significantly higher HAR accuracy with similar compactness and efficiency as compared with the state-of-the-art skeleton-only approaches.
许多现代基于视频的人体动作识别(HAR)方法在预测流程中使用2D骨架作为中间表示。尽管整体结果令人鼓舞,但这些方法在许多常见场景中仍面临挑战,主要原因在于骨架无法捕捉与关节深度、人体轮廓以及人与物体交互相关的关键动作信息。为解决这一问题,我们提出了一种有效方法,在HAR流程中通过一种能捕捉动作相关信息的表示来增强骨架。该表示称为尺度-体素-流(SBF),包含三个不同组件:由每个关节的尺度(因而包含深度信息)给出的尺度图体积、勾勒人体主体的体素图,以及通过像素级光流值表示的人-物交互流图。为预测SBF,我们进一步提出了SFSNet——一种通过骨架和光流进行监督的新型分割网络,无需在现有骨架提取之外增加额外标注开销。在不同数据集上的大量实验表明,基于SBF和SFSNet的流程相比最先进的纯骨架方法,在保持相似紧凑性和效率的同时,显著提升了HAR准确率。
https://arxiv.org/abs/2604.03590
Foundation models (FM) are reshaping computer vision by reducing reliance on task-specific supervised learning and leveraging general visual representations learned at scale. In precision livestock farming, most pipelines remain dominated by supervised learning models that require extensive labeled data, repeated retraining, and farm-specific tuning. This study presents an FM-centered workflow for automated monitoring of group-housed nursery pigs, in which pretrained vision-language FM serve as general visual backbones and farm-specific adaptation is achieved through modular post-processing. Grounding-DINO was first applied to 1,418 annotated images to establish a baseline detection performance. While detection accuracy was high under daytime conditions, performance degraded under night-vision and heavy occlusion, motivating the integration of temporal tracking logic. Building on these detections, short-term video segmentation with Grounded-SAM2 was evaluated on 550 one-minute video clips; after post-processing, over 80% of 4,927 active tracks were fully correct, with most remaining errors arising from inaccurate masks or duplicated labels. To support identity consistency over an extended time, we further developed a long-term tracking pipeline integrating initialization, tracking, matching, mask refinement, re-identification, and post-hoc quality control. This system was evaluated on a continuous 132-minute video and maintained stable identities throughout. On 132 uniformly sampled ground-truth frames, the system achieved a mean region similarity (J) of 0.83, contour accuracy (F) of 0.92, J&F of 0.87, MOTA of 0.99, and MOTP of 90.7%, with no identity switches. Overall, this work demonstrates how FM prior knowledge can be combined with lightweight, task-specific logic to enable scalable, label-efficient, and long-duration monitoring in pig production.
基础模型(FM)正通过减少对特定任务监督学习的依赖并利用大规模学习的通用视觉表征,重塑计算机视觉领域。在精准畜牧业中,大多数流程仍以需要大量标注数据、反复重新训练和农场特定调优的监督学习模型为主。本研究提出了一种以FM为核心的工作流程,用于群养保育猪的自动化监测,其中预训练的视觉-语言FM作为通用视觉主干网络,并通过模块化后处理实现农场特定适应。首先在1418张标注图像上应用Grounding-DINO建立基线检测性能。尽管白天条件下检测准确率较高,但在夜视和严重遮挡场景下性能下降,因此集成了时序追踪逻辑。基于这些检测结果,在550段一分钟视频片段上评估了Grounded-SAM2的短期视频分割效果;经过后处理,4927条活跃追踪轨迹中超过80%完全正确,剩余错误主要源于掩码不准确或标签重复。为实现长期身份一致性,进一步开发了集成初始化、追踪、匹配、掩码精炼、重识别和事后质量控制的长期追踪流程。该系统在连续132分钟视频上进行了评估,全程保持稳定的身份识别。在132个均匀采样的真实帧上,系统达到平均区域相似度(J)0.83、轮廓准确率(F)0.92、J&F值0.87、MOTA指标0.99、MOTP指标90.7%,且无身份切换。总体而言,本研究展示了如何将FM先验知识与轻量级任务特定逻辑相结合,实现猪生产中可扩展、标签高效和长时程监测。
https://arxiv.org/abs/2604.03426
We explore the automatic detection of violin width reduction using 3D photogrammetric meshes. We compare SVM and Decision Trees applied to a geometry-based raw representation built from elevation maps with a more targeted, feature-engineered approach relying on parametric contour lines fitting. Although elevation maps occasionally achieve strong results, their performance does not surpass that of the contour-based inputs.
我们探索利用3D摄影测量网格自动检测小提琴宽度缩减的方法。比较了基于高程图构建的几何原始表示上应用的支持向量机与决策树,以及依赖参数化轮廓线拟合的更具针对性的特征工程方法。尽管高程图偶尔能取得显著效果,但其性能并未超越基于轮廓的输入方式。
https://arxiv.org/abs/2604.02446
Accurate cardiac ultrasound segmentation is essential for reliable assessment of ventricular function in intelligent healthcare systems. However, echocardiographic images are challenging due to low contrast, speckle noise, irregular boundaries, and domain shifts across devices and patient populations. Existing methods, largely based on appearance-driven learning, often fail to preserve boundary precision and structural consistency under these conditions. To address these issues, we propose a Contour-Guided Query Refinement Network (CGQR-Net) for boundary-aware cardiac ultrasound segmentation. The framework integrates multi-resolution feature representations with contour-derived structural priors. An HRNet backbone preserves high-resolution spatial details while capturing multi-scale context. A coarse segmentation is first generated, from which anatomical contours are extracted and encoded into learnable query embeddings. These contour-guided queries interact with fused feature maps via cross-attention, enabling structure-aware refinement that improves boundary delineation and reduces noise artifacts. A dual-head supervision strategy jointly optimizes segmentation and boundary prediction to enforce structural consistency. The proposed method is evaluated on the CAMUS dataset and further validated on the CardiacNet dataset to assess cross-dataset generalization. Experimental results demonstrate improved segmentation accuracy, enhanced boundary precision, and robust performance across varying imaging conditions. These results highlight the effectiveness of integrating contour-level structural information with feature-level representations for reliable cardiac ultrasound segmentation.
https://arxiv.org/abs/2603.28110
Precise segmentation of irregular and densely arranged components is essential for robotic disassembly and material recovery in electronic waste (e-waste) recycling. This study evaluates the impact of model architecture and scale on segmentation performance by comparing SAM2, a transformer-based vision model, with the lightweight YOLOv8 network. Both models were trained and tested on a newly collected dataset of 1,456 annotated RGB images of laptop components including logic boards, heat sinks, and fans, captured under varying illumination and orientation conditions. Data augmentation techniques, such as random rotation, flipping, and cropping, were applied to improve model robustness. YOLOv8 achieved higher segmentation accuracy (mAP50 = 98.8%, mAP50-95 = 85%) and stronger boundary precision than SAM2 (mAP50 = 8.4%). SAM2 demonstrated flexibility in representing diverse object structures but often produced overlapping masks and inconsistent contours. These findings show that large pre-trained models require task-specific optimization for industrial applications. The resulting dataset and benchmarking framework provide a foundation for developing scalable vision algorithms for robotic e-waste disassembly and circular manufacturing systems.
https://arxiv.org/abs/2603.27441
Generalized few-shot semantic segmentation (GFSS) is fundamentally limited by the coverage of novel-class appearances under scarce annotations. While diffusion models can synthesize novel-class images at scale, practical gains are often hindered by insufficient coverage and noisy supervision when masks are unavailable or unreliable. We propose Syn4Seg, a generation-enhanced GFSS framework designed to expand novel-class coverage while improving pseudo-label quality. Syn4Seg first maximizes prompt-space coverage by constructing an embedding-deduplicated prompt bank for each novel class, yielding diverse yet class-consistent synthetic images. It then performs support-guided pseudo-label estimation via a two-stage refinement that i) filters low-consistency regions to obtain high-precision seeds and ii) relabels uncertain pixels with image-adaptive prototypes that combine global (support) and local (image) statistics. Finally, we refine only boundary-band and unlabeled pixels using a constrained SAM-based update to improve contour fidelity without overwriting high-confidence interiors. Extensive experiments on PASCAL-$5^i$ and COCO-$20^i$ demonstrate consistent improvements in both 1-shot and 5-shot settings, highlighting synthetic data as a scalable path for GFSS with reliable masks and precise boundaries.
广义小样本语义分割(GFSS)从根本上受限于稀缺标注下新类外观的覆盖范围。尽管扩散模型能够大规模合成新类图像,但当掩码不可用或不可靠时,覆盖不足和噪声监督往往制约实际效果。我们提出Syn4Seg,一个生成增强型GFSS框架,旨在扩展新类覆盖范围同时提升伪标签质量。Syn4Seg首先通过为每个新类构建嵌入去重的提示词库,最大化提示空间覆盖,生成多样且类别一致的合成图像。随后执行支持引导的伪标签估计,采用两阶段细化:i)过滤低一致性区域以获得高精度种子点;ii)利用结合全局(支持集)与局部(图像)统计的图像自适应原型,对不确定像素进行重标注。最后,仅对边界带和未标注像素执行基于约束SAM的更新,在不覆盖高置信度内部区域的前提下提升轮廓精度。在PASCAL-$5^i$和COCO-$20^i$上的大量实验表明,该方法在1-shot和5-shot设置下均实现稳定提升,凸显了合成数据作为GFSS可扩展路径的潜力,可同时保证可靠掩码与精确边界。
https://arxiv.org/abs/2603.27206
Purpose: Accurate detection and 6D pose estimation of surgical instruments are crucial for many computer-assisted interventions. However, supervised methods lack flexibility for new or unseen tools and require extensive annotated data. This work introduces a training-free pipeline for accurate multi-view 6D pose estimation of unseen surgical instruments, which only requires a textured CAD model as prior knowledge. Methods: Our pipeline consists of two main stages. First, for detection, we generate object mask proposals in each view and score their similarity to rendered templates using a pre-trained feature extractor. Detections are matched across views, triangulated into 3D instance candidates, and filtered using multi-view geometric consistency. Second, for pose estimation, a set of pose hypotheses is iteratively refined and scored using feature-metric scores with cross-view attention. The best hypothesis undergoes a final refinement using a novel multi-view, occlusion-aware contour registration, which minimizes reprojection errors of unoccluded contour points. Results: The proposed method was rigorously evaluated on real-world surgical data from the MVPSP dataset. The method achieves millimeter-accurate pose estimates that are on par with supervised methods under controlled conditions, while maintaining full generalization to unseen instruments. These results demonstrate the feasibility of training-free, marker-less detection and tracking in surgical scenes, and highlight the unique challenges in surgical environments. Conclusion: We present a novel and flexible pipeline that effectively combines state-of-the-art foundational models, multi-view geometry, and contour-based refinement for high-accuracy 6D pose estimation of surgical instruments without task-specific training. This approach enables robust instrument tracking and scene understanding in dynamic clinical environments.
目的:精确检测和6D姿态估计对于许多计算机辅助介入手术至关重要。然而,监督方法缺乏对新工具或未见工具的灵活性,且需要大量标注数据。本研究提出了一种针对未见手术器械的免训练多视图6D姿态估计流程,仅需带纹理的CAD模型作为先验知识。 方法:我们的流程包含两个主要阶段。首先,在检测阶段,我们在每个视图中生成目标掩码提案,并使用预训练的特征提取器评估其与渲染模板的相似性。检测结果跨视图匹配,三角化为3D实例候选,并通过多视图几何一致性进行过滤。其次,在姿态估计阶段,一组姿态假设通过跨视图注意力的特征度量分数进行迭代优化和评分。最佳假设采用一种新颖的多视图、感知遮挡的轮廓配准方法进行最终优化,该方法最小化未遮挡轮廓点的重投影误差。 结果:所提方法在MVPSP数据集的真实手术数据上进行了严格评估。该方法实现了毫米级精度的姿态估计,在受控条件下与监督方法相当,同时保持对未见器械的完全泛化能力。这些结果证明了手术场景中免训练、无标记检测与跟踪的可行性,并凸显了手术环境中的独特挑战。 结论:我们提出了一种新颖灵活的流程,有效结合了最先进的基座模型、多视图几何和基于轮廓的优化,无需任务特定训练即可实现手术器械的高精度6D姿态估计。该方法支持动态临床环境中的稳健器械跟踪与场景理解。
https://arxiv.org/abs/2603.25228
Gait silhouettes, which can be encoded into binary gait codes, are widely adopted to representing motion patterns of pedestrian. Recent approaches commonly leverage visual backbones to encode gait silhouettes, achieving successful performance. However, they primarily focus on continuous visual features, overlooking the discrete nature of binary silhouettes that inherently share a discrete encoding space with natural language. Large Language Models (LLMs) have demonstrated exceptional capability in extracting discriminative features from discrete sequences and modeling long-range dependencies, highlighting their potential to capture temporal motion patterns by identifying subtle variations. Motivated by these observations, we explore bridging binary gait silhouettes and natural language within a binary encoding space. However, the encoding spaces of text tokens and binary gait silhouettes remain misaligned, primarily due to differences in token frequency and density. To address this issue, we propose the Contour-Velocity Tokenizer, which encodes binary gait silhouettes while reshaping their distribution to better align with the text token space. We then establish a dual-branch framework termed Silhouette Language Model, which enhances visual silhouettes by integrating discrete linguistic embeddings derived from LLMs. Implemented on mainstream gait backbones, SilLang consistently improves state-of-the-art methods across SUSTech1K, GREW, and Gait3D.
步态剪影可被编码为二值步态编码,广泛用于表征行人运动模式。近期方法通常利用视觉主干网络对步态剪影进行编码,取得了良好性能。然而这些方法主要关注连续视觉特征,忽略了二值剪影的离散特性——其本质上与自然语言共享离散编码空间。大语言模型(LLMs)已展现出从离散序列中提取判别特征及建模长程依赖的卓越能力,凸显了其通过捕捉细微变化来理解时序运动模式的潜力。受此启发,我们尝试在二值编码空间内建立步态剪影与自然语言的关联。然而文本标记与二值步态剪影的编码空间仍存在错配,主要源于标记频率与密度的差异。为此,我们提出轮廓-速度分词器,在编码二值步态剪影的同时重塑其分布以更好对齐文本标记空间。进而构建名为剪影语言模型的双分支框架,通过融合源自大语言模型的离散语言嵌入来增强视觉剪影表征。该方法在主流步态主干网络上实现,于SUSTech1K、GREW及Gait3D数据集上持续提升现有最优方法性能。
https://arxiv.org/abs/2603.23976
The purpose of multimodal industrial anomaly detection is to detect complex geometric shape defects such as subtle surface deformations and irregular contours that are difficult to detect in 2D-based methods. However, current multimodal industrial anomaly detection lacks the effective use of crucial geometric information like surface normal vectors and 3D shape topology, resulting in low detection accuracy. In this paper, we propose a novel Geometric Prior-based Anomaly Detection network (GPAD). Firstly, we propose a point cloud expert model to perform fine-grained geometric feature extraction, employing differential normal vector computation to enhance the geometric details of the extracted features and generate geometric prior. Secondly, we propose a two-stage fusion strategy to efficiently leverage the complementarity of multimodal data as well as the geometric prior inherent in 3D points. We further propose attention fusion and anomaly regions segmentation based on geometric prior, which enhance the model's ability to perceive geometric defects. Extensive experiments show that our multimodal industrial anomaly detection model outperforms the State-of-the-art (SOTA) methods in detection accuracy on both MVTec-3D AD and Eyecandies datasets.
多模态工业异常检测旨在检测复杂的几何形状缺陷,例如基于二维方法难以检测的细微表面变形和不规则轮廓。然而,当前多模态工业异常检测缺乏对表面法向量和三维形状拓扑等关键几何信息的有效利用,导致检测精度较低。本文提出了一种新颖的基于几何先验的异常检测网络(GPAD)。首先,我们提出点云专家模型进行细粒度几何特征提取,采用微分法向量计算增强提取特征的几何细节并生成几何先验。其次,我们提出两阶段融合策略,以高效利用多模态数据的互补性以及三维点云固有的几何先验。我们还进一步提出了基于几何先验的注意力融合与异常区域分割,增强了模型对几何缺陷的感知能力。大量实验表明,我们的多模态工业异常检测模型在MVTec-3D AD和Eyecandies数据集上的检测精度均优于最先进(SOTA)方法。
https://arxiv.org/abs/2603.22757