Image segmentation is a core task in image processing, yet many methods degrade when images are heavily corrupted by noise and exhibit intensity inhomogeneity. Within the iterative-convolution thresholding method (ICTM) framework, we propose a variational segmentation model that integrates denoising terms. Specifically, the denoising component consists of an I-divergence term and an adaptive total-variation (TV) regularizer, making the model well suited to images contaminated by Gamma--distributed multiplicative noise and Poisson noise. A spatially adaptive weight derived from a gray-level indicator guides diffusion differently across regions of varying intensity. To further address intensity inhomogeneity, we estimate a smoothly varying bias field, which improves segmentation accuracy. Regions are represented by characteristic functions, with contour length encoded accordingly. For efficient optimization, we couple ICTM with a relaxed modified scalar auxiliary variable (RMSAV) scheme. Extensive experiments on synthetic and real-world images with intensity inhomogeneity and diverse noise types show that the proposed model achieves superior accuracy and robustness compared with competing approaches.
https://arxiv.org/abs/2511.08988
Egocentric visual query localization is vital for embodied AI and VR/AR, yet remains challenging due to camera motion, viewpoint changes, and appearance variations. We present EAGLE, a novel framework that leverages episodic appearance- and geometry-aware memory to achieve unified 2D-3D visual query localization in egocentric vision. Inspired by avian memory consolidation, EAGLE synergistically integrates segmentation guided by an appearance-aware meta-learning memory (AMM), with tracking driven by a geometry-aware localization memory (GLM). This memory consolidation mechanism, through structured appearance and geometry memory banks, stores high-confidence retrieval samples, effectively supporting both long- and short-term modeling of target appearance variations. This enables precise contour delineation with robust spatial discrimination, leading to significantly improved retrieval accuracy. Furthermore, by integrating the VQL-2D output with a visual geometry grounded Transformer (VGGT), we achieve a efficient unification of 2D and 3D tasks, enabling rapid and accurate back-projection into 3D space. Our method achieves state-ofthe-art performance on the Ego4D-VQ benchmark.
https://arxiv.org/abs/2511.08007
Although significant advances have been achieved in SAR land-cover classification, recent methods remain predominantly focused on supervised learning, which relies heavily on extensive labeled datasets. This dependency not only limits scalability and generalization but also restricts adaptability to diverse application scenarios. In this paper, a general-purpose foundation model for SAR land-cover classification is developed, serving as a robust cornerstone to accelerate the development and deployment of various downstream models. Specifically, a Dynamic Instance and Contour Consistency Contrastive Learning (DI3CL) pre-training framework is presented, which incorporates a Dynamic Instance (DI) module and a Contour Consistency (CC) module. DI module enhances global contextual awareness by enforcing local consistency across different views of the same region. CC module leverages shallow feature maps to guide the model to focus on the geometric contours of SAR land-cover objects, thereby improving structural discrimination. Additionally, to enhance robustness and generalization during pre-training, a large-scale and diverse dataset named SARSense, comprising 460,532 SAR images, is constructed to enable the model to capture comprehensive and representative features. To evaluate the generalization capability of our foundation model, we conducted extensive experiments across a variety of SAR land-cover classification tasks, including SAR land-cover mapping, water body detection, and road extraction. The results consistently demonstrate that the proposed DI3CL outperforms existing methods. Our code and pre-trained weights are publicly available at: this https URL.
https://arxiv.org/abs/2511.07808
Accurate fluence map prediction is essential in intensity-modulated radiation therapy (IMRT) to maximize tumor coverage while minimizing dose to healthy tissues. Conventional optimization is time-consuming and dependent on planner expertise. This study presents a deep learning framework that accelerates fluence map generation while maintaining clinical quality. An end-to-end 3D Swin-UNETR network was trained to predict nine-beam fluence maps directly from volumetric CT images and anatomical contours using 99 prostate IMRT cases (79 for training and 20 for testing). The transformer-based model employs hierarchical self-attention to capture both local anatomical structures and long-range spatial dependencies. Predicted fluence maps were imported into the Eclipse Treatment Planning System for dose recalculation, and model performance was evaluated using beam-wise fluence correlation, spatial gamma analysis, and dose-volume histogram (DVH) metrics. The proposed model achieved an average R^2 of 0.95 +/- 0.02, MAE of 0.035 +/- 0.008, and gamma passing rate of 85 +/- 10 percent (3 percent / 3 mm) on the test set, with no significant differences observed in DVH parameters between predicted and clinical plans. The Swin-UNETR framework enables fully automated, inverse-free fluence map prediction directly from anatomical inputs, enhancing spatial coherence, accuracy, and efficiency while offering a scalable and consistent solution for automated IMRT plan generation.
https://arxiv.org/abs/2511.08645
Recent advances in latent diffusion models have demonstrated state-of-the-art performance in high-dimensional time-series data synthesis while providing flexible control through conditioning and guidance. However, existing methodologies primarily rely on musical context or natural language as the main modality of interacting with the generative process, which may not be ideal for expert users who seek precise fader-like control over specific musical attributes. In this work, we explore the application of denoising diffusion processes as plug-and-play latent constraints for unconditional symbolic music generation models. We focus on a framework that leverages a library of small conditional diffusion models operating as implicit probabilistic priors on the latents of a frozen unconditional backbone. While previous studies have explored domain-specific use cases, this work, to the best of our knowledge, is the first to demonstrate the versatility of such an approach across a diverse array of musical attributes, such as note density, pitch range, contour, and rhythm complexity. Our experiments show that diffusion-driven constraints outperform traditional attribute regularization and other latent constraints architectures, achieving significantly stronger correlations between target and generated attributes while maintaining high perceptual quality and diversity.
https://arxiv.org/abs/2511.07156
The mean squared error (MSE) is a ubiquitous loss function for speech enhancement, but its problem is that the error cannot reflect the auditory perception quality. This is because MSE causes models to over-emphasize low-frequency components which has high energy, leading to the inadequate modeling of perceptually important high-frequency information. To overcome this limitation, we propose a perceptually-weighted loss function grounded in psychoacoustic principles. Specifically, it leverages equal-loudness contours to assign frequency-dependent weights to the reconstruction error, thereby penalizing deviations in a way aligning with human auditory sensitivity. The proposed loss is model-agnostic and flexible, demonstrating strong generality. Experiments on the VoiceBank+DEMAND dataset show that replacing MSE with our loss in a GTCRN model elevates the WB-PESQ score from 2.17 to 2.93-a significant improvement in perceptual quality.
https://arxiv.org/abs/2511.05945
End-to-end text spotting aims to jointly optimize text detection and recognition within a unified framework. Despite significant progress, designing an accurate and efficient end-to-end text spotter for arbitrary-shaped text remains largely unsolved. We identify the primary bottleneck as the lack of a reliable and efficient text detection method. To address this, we propose a novel parameterized text shape method based on low-rank approximation for precise detection and a triple assignment detection head to enable fast inference. Specifically, unlike other shape representation methods that employ data-irrelevant parameterization, our data-driven approach derives a low-rank subspace directly from labeled text boundaries. To ensure this process is robust against the inherent annotation noise in this data, we utilize a specialized recovery method based on an $\ell_1$-norm formulation, which accurately reconstructs the text shape with only a few key orthogonal vectors. By exploiting the inherent shape correlation among different text contours, our method achieves consistency and compactness in shape representation. Next, the triple assignment scheme introduces a novel architecture where a deep sparse branch (for stabilized training) is used to guide the learning of an ultra-lightweight sparse branch (for accelerated inference), while a dense branch provides rich parallel supervision. Building upon these advancements, we integrate the enhanced detection module with a lightweight recognition branch to form an end-to-end text spotting framework, termed LRANet++, capable of accurately and efficiently spotting arbitrary-shaped text. Extensive experiments on several challenging benchmarks demonstrate the superiority of LRANet++ compared to state-of-the-art methods. Code will be available at: this https URL
https://arxiv.org/abs/2511.05818
This paper presents Walk the Lines 2 (WtL2), a unique contour tracking algorithm specifically adapted for detailed segmentation of infrared (IR) ships and various objects in RGB.1 This extends the original Walk the Lines (WtL) [12], which focused solely on detailed ship segmentation in color. These innovative WtLs can replace the standard non-maximum suppression (NMS) by using contour tracking to refine the object contour until a 1-pixel-wide closed shape can be binarized, forming a segmentable area in foreground-background scenarios. WtL2 broadens the application range of WtL beyond its original scope, adapting to IR and expanding to diverse objects within the RGB context. To achieve IR segmentation, we adapt its input, the object contour detector, to IR ships. In addition, the algorithm is enhanced to process a wide range of RGB objects, outperforming the latest generation of contour-based methods when achieving a closed object contour, offering high peak Intersection over Union (IoU) with impressive details. This positions WtL2 as a compelling method for specialized applications that require detailed segmentation or high-quality samples, potentially accelerating progress in several niche areas of image segmentation.
https://arxiv.org/abs/2511.05210
Accurate and efficient voxelized representations of 3D meshes are the foundation of 3D reconstruction and generation. However, existing representations based on iso-surface heavily rely on water-tightening or rendering optimization, which inevitably compromise geometric fidelity. We propose Faithful Contouring, a sparse voxelized representation that supports 2048+ resolutions for arbitrary meshes, requiring neither converting meshes to field functions nor extracting the isosurface during remeshing. It achieves near-lossless fidelity by preserving sharpness and internal structures, even for challenging cases with complex geometry and topology. The proposed method also shows flexibility for texturing, manipulation, and editing. Beyond representation, we design a dual-mode autoencoder for Faithful Contouring, enabling scalable and detail-preserving shape reconstruction. Extensive experiments show that Faithful Contouring surpasses existing methods in accuracy and efficiency for both representation and reconstruction. For direct representation, it achieves distance errors at the $10^{-5}$ level; for mesh reconstruction, it yields a 93\% reduction in Chamfer Distance and a 35\% improvement in F-score over strong baselines, confirming superior fidelity as a representation for 3D learning tasks.
https://arxiv.org/abs/2511.04029
Faithful yet compact explanations for vision models remain a challenge, as commonly used dense perturbation masks are often fragmented and overfitted, needing careful post-processing. Here, we present a training-free explanation method that replaces dense masks with smooth tunable contours. A star-convex region is parameterized by a truncated Fourier series and optimized under an extremal preserve/delete objective using the classifier gradients. The approach guarantees a single, simply connected mask, cuts the number of free parameters by orders of magnitude, and yields stable boundary updates without cleanup. Restricting solutions to low-dimensional, smooth contours makes the method robust to adversarial masking artifacts. On ImageNet classifiers, it matches the extremal fidelity of dense masks while producing compact, interpretable regions with improved run-to-run consistency. Explicit area control also enables importance contour maps, yielding a transparent fidelity-area profiles. Finally, we extend the approach to multi-contour and show how it can localize multiple objects within the same framework. Across benchmarks, the method achieves higher relevance mass and lower complexity than gradient and perturbation based baselines, with especially strong gains on self-supervised DINO models where it improves relevance mass by over 15% and maintains positive faithfulness correlations.
https://arxiv.org/abs/2511.01411
Recent advances in vision-language models (VLMs) have enabled impressive multimodal reasoning, yet most medical applications remain limited to 2D imaging. In this work, we extend VLMs to 3D positron emission tomography and computed tomography (PET/CT), a domain characterized by large volumetric data, small and dispersed lesions, and lengthy radiology reports. We introduce a large-scale dataset comprising over 11,000 lesion-level descriptions paired with 3D segmentations from more than 5,000 PET/CT exams, extracted via a hybrid rule-based and large language model (LLM) pipeline. Building upon this dataset, we propose PETAR-4B, a 3D mask-aware vision-language model that integrates PET, CT, and lesion contours for spatially grounded report generation. PETAR bridges global contextual reasoning with fine-grained lesion awareness, producing clinically coherent and localized findings. Comprehensive automated and human evaluations demonstrate that PETAR substantially improves PET/CT report generation quality, advancing 3D medical vision-language understanding.
https://arxiv.org/abs/2510.27680
Early detection of coronary artery disease (CAD) is critical for reducing mortality and improving patient treatment planning. While angiographic image analysis from X-rays is a common and cost-effective method for identifying cardiac abnormalities, including stenotic coronary arteries, poor image quality can significantly impede clinical diagnosis. We present the Coronary Artery Segmentation and Refinement Network (CASR-Net), a three-stage pipeline comprising image preprocessing, segmentation, and refinement. A novel multichannel preprocessing strategy combining CLAHE and an improved Ben Graham method provides incremental gains, increasing Dice Score Coefficient (DSC) by 0.31-0.89% and Intersection over Union (IoU) by 0.40-1.16% compared with using the techniques individually. The core innovation is a segmentation network built on a UNet with a DenseNet121 encoder and a Self-organized Operational Neural Network (Self-ONN) based decoder, which preserves the continuity of narrow and stenotic vessel branches. A final contour refinement module further suppresses false positives. Evaluated with 5-fold cross-validation on a combination of two public datasets that contain both healthy and stenotic arteries, CASR-Net outperformed several state-of-the-art models, achieving an IoU of 61.43%, a DSC of 76.10%, and clDice of 79.36%. These results highlight a robust approach to automated coronary artery segmentation, offering a valuable tool to support clinicians in diagnosis and treatment planning.
https://arxiv.org/abs/2510.27315
Boundary Vector Cells (BVCs) are a class of neurons in the brains of vertebrates that encode environmental boundaries at specific distances and allocentric directions, playing a central role in forming place fields in the hippocampus. Most computational BVC models are restricted to two-dimensional (2D) environments, making them prone to spatial ambiguities in the presence of horizontal symmetries in the environment. To address this limitation, we incorporate vertical angular sensitivity into the BVC framework, thereby enabling robust boundary detection in three dimensions, and leading to significantly more accurate spatial localization in a biologically-inspired robot model. The proposed model processes LiDAR data to capture vertical contours, thereby disambiguating locations that would be indistinguishable under a purely 2D representation. Experimental results show that in environments with minimal vertical variation, the proposed 3D model matches the performance of a 2D baseline; yet, as 3D complexity increases, it yields substantially more distinct place fields and markedly reduces spatial aliasing. These findings show that adding a vertical dimension to BVC-based localization can significantly enhance navigation and mapping in real-world 3D spaces while retaining performance parity in simpler, near-planar scenarios.
https://arxiv.org/abs/2510.24029
Computer vision can accelerate ecological research and conservation monitoring, yet adoption in ecology lags in part because of a lack of trust in black-box neural-network-based models. We seek to address this challenge by applying post-hoc explanations to provide evidence for predictions and document limitations that are important to field deployment. Using aerial imagery from Glacier Bay National Park, we train a Faster R-CNN to detect pinnipeds (harbor seals) and generate explanations via gradient-based class activation mapping (HiResCAM, LayerCAM), local interpretable model-agnostic explanations (LIME), and perturbation-based explanations. We assess explanations along three axes relevant to field use: (i) localization fidelity: whether high-attribution regions coincide with the animal rather than background context; (ii) faithfulness: whether deletion/insertion tests produce changes in detector confidence; and (iii) diagnostic utility: whether explanations reveal systematic failure modes. Explanations concentrate on seal torsos and contours rather than surrounding ice/rock, and removal of the seals reduces detection confidence, providing model-evidence for true positives. The analysis also uncovers recurrent error sources, including confusion between seals and black ice and rocks. We translate these findings into actionable next steps for model development, including more targeted data curation and augmentation. By pairing object detection with post-hoc explainability, we can move beyond "black-box" predictions toward auditable, decision-supporting tools for conservation monitoring.
https://arxiv.org/abs/2510.21689
Single-source Domain Generalized Object Detection (SDGOD), as a cutting-edge research topic in computer vision, aims to enhance model generalization capability in unseen target domains through single-source domain training. Current mainstream approaches attempt to mitigate domain discrepancies via data augmentation techniques. However, due to domain shift and limited domain-specific knowledge, models tend to fall into the pitfall of spurious correlations. This manifests as the model's over-reliance on simplistic classification features (e.g., color) rather than essential domain-invariant representations like object contours. To address this critical challenge, we propose the Cauvis (Causal Visual Prompts) method. First, we introduce a Cross-Attention Prompts module that mitigates bias from spurious features by integrating visual prompts with cross-attention. To address the inadequate domain knowledge coverage and spurious feature entanglement in visual prompts for single-domain generalization, we propose a dual-branch adapter that disentangles causal-spurious features while achieving domain adaptation via high-frequency feature extraction. Cauvis achieves state-of-the-art performance with 15.9-31.4% gains over existing domain generalization methods on SDGOD datasets, while exhibiting significant robustness advantages in complex interference environments.
单一来源领域泛化目标检测(SDGOD)作为计算机视觉领域的前沿研究课题,旨在通过单源域训练来提升模型在未见目标域中的泛化能力。当前主流方法试图通过数据增强技术来缓解不同领域间的差异性问题。然而,由于领域偏移和特定领域知识的限制,模型容易陷入虚假相关性的陷阱中。这表现为模型过分依赖于简单的分类特征(例如颜色),而不是像物体轮廓这样的领域不变表示特征。 为了解决这一关键挑战,我们提出了Cauvis(因果视觉提示)方法。首先,我们引入了一个跨注意力提示模块,通过将视觉提示与跨注意力机制结合来减轻虚假特征偏差的影响。为了应对单域泛化中视觉提示的领域知识覆盖不足和虚假特征纠缠的问题,我们提出了一种双分支适配器,该适配器在提取高频特征实现领域适应的同时,还能分离因果虚假特征。 Cauvis 方法在 SDGOD 数据集上相对于现有领域的泛化方法取得了 15.9-31.4% 的性能提升,并且在复杂的干扰环境中展示了显著的鲁棒性优势。
https://arxiv.org/abs/2510.19487
Circulating blood cell clusters (CCCs) containing red blood cells (RBCs), white blood cells(WBCs), and platelets are significant biomarkers linked to conditions like thrombosis, infection, and inflammation. Flow cytometry, paired with fluorescence staining, is commonly used to analyze these cell clusters, revealing cell morphology and protein profiles. While computational approaches based on machine learning have advanced the automatic analysis of single-cell flow cytometry images, there is a lack of effort to build tools to automatically analyze images containing CCCs. Unlike single cells, cell clusters often exhibit irregular shapes and sizes. In addition, these cell clusters often consist of heterogeneous cell types, which require multi-channel staining to identify the specific cell types within the clusters. This study introduces a new computational framework for analyzing CCC images and identifying cell types within clusters. Our framework uses a two-step analysis strategy. First, it categorizes images into cell cluster and non-cluster groups by fine-tuning the You Only Look Once(YOLOv11) model, which outperforms traditional convolutional neural networks (CNNs), Vision Transformers (ViT). Then, it identifies cell types by overlaying cluster contours with regions from multi-channel fluorescence stains, enhancing accuracy despite cell debris and staining artifacts. This approach achieved over 95% accuracy in both cluster classification and phenotype identification. In summary, our automated framework effectively analyzes CCC images from flow cytometry, leveraging both bright-field and fluorescence data. Initially tested on blood cells, it holds potential for broader applications, such as analyzing immune and tumor cell clusters, supporting cellular research across various diseases.
含红细胞(RBC)、白细胞(WBC)和血小板的循环血液细胞团簇(CCC)是与血栓形成、感染和炎症等状况相关的显著生物标志物。流式细胞术结合荧光染色常用于分析这些细胞团簇,揭示细胞形态和蛋白质谱型。虽然基于机器学习的方法在单个细胞流式细胞图像自动分析方面取得了进展,但缺乏构建可用于自动分析包含CCC的图像工具的努力。与单独的细胞不同,细胞团簇通常表现出不规则形状和大小,并且这些细胞团簇往往由不同的细胞类型组成,需要使用多通道染色来识别团簇内的特定细胞类型。 本研究介绍了一种新的计算框架,用于分析CCC图像并确定集群中的细胞类型。该框架采用两步分析策略:首先通过微调You Only Look Once(YOLOv11)模型将图像分类为细胞团簇和非团簇组,这种方法优于传统的卷积神经网络(CNNs)和视觉变压器(ViT)。然后,它通过在多通道荧光染色区域上叠加群集轮廓来识别细胞类型,从而即使在存在细胞碎片和染色伪影的情况下也能提高准确性。这一方法在集群分类和表型鉴定方面均达到了超过95%的准确率。 总之,我们的自动化框架有效地分析了流式细胞术产生的CCC图像,结合了明场和荧光数据。最初用于血细胞测试后,它具有更广泛的应用潜力,例如分析免疫和肿瘤细胞团簇,支持各种疾病背景下的细胞研究。
https://arxiv.org/abs/2510.17716
Lung cancer remains the leading cause of cancer mortality, with CT imaging central to screening, prognosis, and treatment. Manual segmentation is variable and time-intensive, while deep learning (DL) offers automation but faces barriers to clinical adoption. Guided by the Knowledge-to-Action framework, this study develops a clinician-in-the-loop DL pipeline to enhance reproducibility, prognostic accuracy, and clinical trust. Multi-center CT data from 999 patients across 12 public datasets were analyzed using five DL models (3D Attention U-Net, ResUNet, VNet, ReconNet, SAM-Med3D), benchmarked against expert contours on whole and click-point cropped images. Segmentation reproducibility was assessed using 497 PySERA-extracted radiomic features via Spearman correlation, ICC, Wilcoxon tests, and MANOVA, while prognostic modeling compared supervised (SL) and semi-supervised learning (SSL) across 38 dimensionality reduction strategies and 24 classifiers. Six physicians qualitatively evaluated masks across seven domains, including clinical meaningfulness, boundary quality, prognostic value, trust, and workflow integration. VNet achieved the best performance (Dice = 0.83, IoU = 0.71), radiomic stability (mean correlation = 0.76, ICC = 0.65), and predictive accuracy under SSL (accuracy = 0.88, F1 = 0.83). SSL consistently outperformed SL across models. Radiologists favored VNet for peritumoral representation and smoother boundaries, preferring AI-generated initial masks for refinement rather than replacement. These results demonstrate that integrating VNet with SSL yields accurate, reproducible, and clinically trusted CT-based lung cancer prognosis, highlighting a feasible path toward physician-centered AI translation.
肺癌仍然是癌症死亡的主要原因,CT成像在筛查、预后和治疗中处于核心地位。手动分割过程存在可变性和耗时的问题,而深度学习(DL)虽然提供了自动化解决方案,但面临着临床应用的障碍。本研究遵循知识到行动框架,开发了一种由医生参与的深度学习管道,以提高再现性、预测准确性以及临床信任度。 通过使用来自12个公共数据集中的999名患者的多中心CT数据,五种深度学习模型(3D注意力U-Net、ResUNet、VNet、ReconNet和SAM-Med3D)在完整图像与点击点裁剪的图像上进行对比测试,以专家轮廓为基准。通过Spearman相关性、ICC(组内一致性系数)、Wilcoxon检验及MANOVA对分割再现性的497个PySERA提取放射学特征进行了评估;同时通过38种降维策略和24种分类器比较了监督学习(SL)与半监督学习(SSL)的预后建模。六位医生从七个领域,包括临床意义、边界质量、预测价值、信任度及工作流程整合对掩码进行定性评估。 VNet在性能上表现出色(Dice系数=0.83,IoU=0.71),放射学稳定性表现优异(平均相关性=0.76,ICC=0.65)并在SSL下具有更高的预测准确度(准确性=0.88,F1值=0.83)。在各种模型上,SSL始终优于SL。放射科医生更倾向于使用VNet进行肿瘤周围组织的表示及平滑边界,并且他们偏好AI生成的初步掩码供后续细化而非替代现有操作。 这些结果表明,结合VNet与SSL可以为基于CT图像肺癌预后的准确、可重复性以及临床可信度提供一条可行路径。这强调了一种医生为中心的人工智能转化方法的有效应用。
https://arxiv.org/abs/2510.17039
Unsupervised anomaly detection (UAD) presents a complementary alternative to supervised learning for brain tumor segmentation in magnetic resonance imaging (MRI), particularly when annotated datasets are limited, costly, or inconsistent. In this work, we propose a novel Multimodal Vision Transformer Autoencoder (MViT-AE) trained exclusively on healthy brain MRIs to detect and localize tumors via reconstruction-based error maps. This unsupervised paradigm enables segmentation without reliance on manual labels, addressing a key scalability bottleneck in neuroimaging workflows. Our method is evaluated in the BraTS-GoAT 2025 Lighthouse dataset, which includes various types of tumors such as gliomas, meningiomas, and pediatric brain tumors. To enhance performance, we introduce a multimodal early-late fusion strategy that leverages complementary information across multiple MRI sequences, and a post-processing pipeline that integrates the Segment Anything Model (SAM) to refine predicted tumor contours. Despite the known challenges of UAD, particularly in detecting small or non-enhancing lesions, our method achieves clinically meaningful tumor localization, with lesion-wise Dice Similarity Coefficient of 0.437 (Whole Tumor), 0.316 (Tumor Core), and 0.350 (Enhancing Tumor) on the test set, and an anomaly Detection Rate of 89.4% on the validation set. These findings highlight the potential of transformer-based unsupervised models to serve as scalable, label-efficient tools for neuro-oncological imaging.
无监督异常检测(UAD)为磁共振成像(MRI)中的脑肿瘤分割提供了一种补充的替代方案,尤其是在标注数据集有限、成本高昂或不一致的情况下。在这项工作中,我们提出了一种新颖的多模态视觉变压器自编码器(MViT-AE),该模型仅在健康的大脑MRI上进行训练,通过基于重建的误差图来检测和定位肿瘤。这种无监督方法使分割可以无需依赖人工标注标签,解决了神经影像工作流程中的关键可扩展性瓶颈问题。我们的方法在BraTS-GoAT 2025灯塔数据集中进行了评估,该数据集包括多种类型的肿瘤,如胶质瘤、脑膜瘤和儿童脑肿瘤。 为了提高性能,我们引入了一种多模态早期-晚期融合策略,利用了多个MRI序列之间的互补信息,并且采用了一个后处理管道,将Segment Anything Model(SAM)整合进来以精炼预测的肿瘤轮廓。尽管UAD在检测小病变或非增强性病灶方面存在已知挑战,我们的方法仍然实现了具有临床意义的肿瘤定位,在测试集上全瘤体、肿瘤核心和增强型肿瘤的Dice相似系数分别为0.437、0.316和0.350,并且在验证集上的异常检测率为89.4%。这些发现突显了基于变压器的无监督模型作为神经肿瘤学成像中的高效可扩展工具的巨大潜力。
https://arxiv.org/abs/2510.15684
Effective assessment of mobile network coverage and the precise identification of service weak spots are paramount for network operators striving to enhance user Quality of Experience (QoE). This paper presents a novel framework for mobile coverage and weak spot analysis utilising crowdsourced QoE data. The core of our methodology involves coverage analysis at the individual cell (antenna) level, subsequently aggregated to the site level, using empirical geolocation data. A key contribution of this research is the application of One-Class Support Vector Machine (OC-SVM) algorithm for calculating mobile network coverage. This approach models the decision hyperplane as the effective coverage contour, facilitating robust calculation of coverage areas for individual cells and entire sites. The same methodology is extended to analyse crowdsourced service loss reports, thereby identifying and quantifying geographically localised weak spots. Our findings demonstrate the efficacy of this novel framework in accurately mapping mobile coverage and, crucially, in highlighting granular areas of signal deficiency, particularly within complex urban environments.
有效评估移动网络覆盖范围并精确识别服务弱点对于运营商提升用户体验质量(QoE)至关重要。本文提出了一种利用众包QoE数据进行移动覆盖和弱点分析的新框架。我们的方法核心在于基于实证地理位置数据,对单个小区(天线)级别的覆盖情况进行分析,并将其汇总到站点级别。这项研究的重要贡献之一是应用了一类支持向量机(OC-SVM)算法来计算移动网络的覆盖范围。这种方法将决策超平面作为有效的覆盖轮廓进行建模,从而能够为单个小区乃至整个站点精确地计算出覆盖区域。同样的方法被扩展用于分析众包的服务中断报告,以便识别和量化地理位置特定的弱点。 研究结果表明,这种新框架在准确绘制移动网络覆盖范围方面十分有效,并且特别强调了复杂城市环境中信号薄弱的具体区域。这种方法能够帮助运营商更加细致地了解其服务中的问题所在,从而有针对性地进行改进,提升用户的整体体验质量。
https://arxiv.org/abs/2510.13459
Old-photo face restoration poses significant challenges due to compounded degradations such as breakage, fading, and severe blur. Existing pre-trained diffusion-guided methods either rely on explicit degradation priors or global statistical guidance, which struggle with localized artifacts or face color. We propose Self-Supervised Selective-Guided Diffusion (SSDiff), which leverages pseudo-reference faces generated by a pre-trained diffusion model under weak guidance. These pseudo-labels exhibit structurally aligned contours and natural colors, enabling region-specific restoration via staged supervision: structural guidance applied throughout the denoising process and color refinement in later steps, aligned with the coarse-to-fine nature of diffusion. By incorporating face parsing maps and scratch masks, our method selectively restores breakage regions while avoiding identity mismatch. We further construct VintageFace, a 300-image benchmark of real old face photos with varying degradation levels. SSDiff outperforms existing GAN-based and diffusion-based methods in perceptual quality, fidelity, and regional controllability. Code link: this https URL.
旧照片面部恢复面临着由断裂、褪色和严重模糊等多重退化造成的重大挑战。现有预训练的扩散引导方法要么依赖明确的降级先验,要么依靠全局统计指导,这些方法在处理局部伪影或面部颜色方面存在困难。我们提出了一种自监督选择性引导扩散(SSDiff)方法,该方法利用弱引导下由预训练扩散模型生成的伪参考面孔。这些伪标签表现出结构对齐的轮廓和自然的颜色,使区域特异性恢复通过分阶段监督得以实现:在整个去噪过程中应用结构性指导,并在后续步骤中进行颜色细化,与从粗到细的扩散性质相吻合。通过结合面部解析图和划痕掩码,我们的方法选择性地修复断裂区域,同时避免身份不符的问题。 我们进一步构建了VintageFace,这是一个包含300张具有不同退化程度的真实旧面孔照片的数据集基准。SSDiff在感知质量、保真度和区域可控性方面优于现有的基于GAN的方法和扩散方法。代码链接:[此处提供URL]。
https://arxiv.org/abs/2510.12114