Segmentation

RSFR: A Coarse-to-Fine Reconstruction Framework for Diffusion Tensor Cardiac MRI with Semantic-Aware Refinement

2025-04-25 17:41:14

Jiahao Huang, Fanwen Wang, Pedro F. Ferreira, Haosen Zhang, Yinzhe Wu, Zhifan Gao, Lei Zhu, Angelica I. Aviles-Rivero, Carola-Bibiane Schonlieb, Andrew D. Scott, Zohya Khalique, Maria Dwornik, Ramyah Rajakulasingam, Ranil De Silva, Dudley J. Pennell, Guang Yang, Sonia Nielles-Vallespin

arXiv_CV

arXiv_CV Segmentation Quantitative Zero-Shot Reconstruction Diffusion
Abstract

Cardiac diffusion tensor imaging (DTI) offers unique insights into cardiomyocyte arrangements, bridging the gap between microscopic and macroscopic cardiac function. However, its clinical utility is limited by technical challenges, including a low signal-to-noise ratio, aliasing artefacts, and the need for accurate quantitative fidelity. To address these limitations, we introduce RSFR (Reconstruction, Segmentation, Fusion & Refinement), a novel framework for cardiac diffusion-weighted image reconstruction. RSFR employs a coarse-to-fine strategy, leveraging zero-shot semantic priors via the Segment Anything Model and a robust Vision Mamba-based reconstruction backbone. Our framework integrates semantic features effectively to mitigate artefacts and enhance fidelity, achieving state-of-the-art reconstruction quality and accurate DT parameter estimation under high undersampling rates. Extensive experiments and ablation studies demonstrate the superior performance of RSFR compared to existing methods, highlighting its robustness, scalability, and potential for clinical translation in quantitative cardiac DTI.

Abstract (translated)

心脏扩散张量成像（DTI）能够提供有关心肌细胞排列的独特见解，从而弥合微观和宏观心脏功能之间的差距。然而，其临床应用受到技术挑战的限制，包括信噪比低、混叠伪影以及准确量化保真度的需求。为了解决这些局限性，我们引入了一种新的心脏扩散加权图像重建框架——RSFR（重构、分割、融合及精炼）。RSFR采用粗到细策略，并利用Segment Anything Model的零样本语义先验和稳健的Vision Mamba基线重建技术。我们的框架能够有效整合语义特征以减少伪影并增强保真度，在高采样不足率下实现最先进的重建质量和准确的心脏DT参数估计。广泛的实验与消融研究表明，RSFR相比现有方法表现出更优性能，突显了其在定量心脏DTI临床应用中的稳健性、可扩展性和潜力。

URL

https://arxiv.org/abs/2504.18520

PDF

https://arxiv.org/pdf/2504.18520.pdf
Read All
Iterative Event-based Motion Segmentation by Variational Contrast Maximization

2025-04-25 16:00:23

Ryo Yamaki, Shintaro Shiba, Guillermo Gallego, Yoshimitsu Aoki

arXiv_AI

arXiv_AI Segmentation Detection Object_Detection Pose
Abstract

Event cameras provide rich signals that are suitable for motion estimation since they respond to changes in the scene. As any visual changes in the scene produce event data, it is paramount to classify the data into different motions (i.e., motion segmentation), which is useful for various tasks such as object detection and visual servoing. We propose an iterative motion segmentation method, by classifying events into background (e.g., dominant motion hypothesis) and foreground (independent motion residuals), thus extending the Contrast Maximization framework. Experimental results demonstrate that the proposed method successfully classifies event clusters both for public and self-recorded datasets, producing sharp, motion-compensated edge-like images. The proposed method achieves state-of-the-art accuracy on moving object detection benchmarks with an improvement of over 30%, and demonstrates its possibility of applying to more complex and noisy real-world scenes. We hope this work broadens the sensitivity of Contrast Maximization with respect to both motion parameters and input events, thus contributing to theoretical advancements in event-based motion segmentation estimation. this https URL

Abstract (translated)

事件相机提供丰富的信号，非常适合运动估计，因为它们对场景中的变化作出反应。由于任何视觉场景的变化都会产生事件数据，因此将这些数据分类为不同的运动（即，运动分割）对于物体检测和视觉伺服等任务非常有用。我们提出了一种迭代的运动分割方法，通过将事件分类为背景（例如，主导运动假设）和前景（独立运动残差），从而扩展了对比最大化框架。实验结果表明，所提出的方法在公共数据集和自记录数据集中成功地对事件簇进行了分类，并生成了清晰、补偿了运动模糊的边缘状图像。该方法在移动物体检测基准测试中实现了最先进的准确率，提高了超过30%，并展示了其应用于更复杂且噪声更大的现实场景的可能性。我们希望这项工作能够扩展对比最大化对于运动参数和输入事件的敏感度，从而促进基于事件的运动分割估计理论的进步。[1] 注释中的“this https URL”可能是指向该工作的详细论文或技术报告链接，在实际引用中需要替换为具体的URL地址以供访问相关文献。 --- 请根据具体需求进一步查阅原始文献获取更多细节和数据支持。

URL

https://arxiv.org/abs/2504.18447

PDF

https://arxiv.org/pdf/2504.18447.pdf
Read All
Nearly isotropic segmentation for medial temporal lobe subregions in multi-modality MRI

2025-04-25 15:54:03

Yue Li, Pulkit Khandelwal, Long Xie, Laura E. M. Wisse, Nidhi Mundada, Christopher A. Brown, Emily McGrew, Amanda Denning, Sandhitsu R. Das, David A. Wolk, Paul A. Yushkevich

arXiv_CV

arXiv_CV Segmentation Deep_Learning
Abstract

Morphometry of medial temporal lobe (MTL) subregions in brain MRI is sensitive biomarker to Alzheimers Disease and other related conditions. While T2-weighted (T2w) MRI with high in-plane resolution is widely used to segment hippocampal subfields due to its higher contrast in hippocampus, its lower out-of-plane resolution reduces the accuracy of subregion thickness measurements. To address this issue, we developed a nearly isotropic segmentation pipeline that incorporates image and label upsampling and high-resolution segmentation in T2w MRI. First, a high-resolution atlas was created based on an existing anisotropic atlas derived from 29 individuals. Both T1-weighted and T2w images in the atlas were upsampled from their original resolution to a nearly isotropic resolution 0.4x0.4x0.52mm3 using a non-local means approach. Manual segmentations within the atlas were also upsampled to match this resolution using a UNet-based neural network, which was trained on a cohort consisting of both high-resolution ex vivo and low-resolution anisotropic in vivo MRI with manual segmentations. Second, a multi-modality deep learning-based segmentation model was trained within this nearly isotropic atlas. Finally, experiments showed the nearly isotropic subregion segmentation improved the accuracy of cortical thickness as an imaging biomarker for neurodegeneration in T2w MRI.

Abstract (translated)

大脑磁共振成像（MRI）中内侧颞叶（MTL）亚区的形态测量是阿尔茨海默病及其他相关疾病的敏感生物标志物。尽管T2加权（T2w）MRI因其在海马体中的较高对比度而被广泛用于分割海马子场，但其较低的层间分辨率降低了亚区厚度测量的准确性。为解决这一问题，我们开发了一种近似各向同性的分割流程，该流程结合了图像和标签上采样以及T2w MRI中高分辨率分割。首先，基于从29位个体衍生出的一种各向异性图谱创建了一个高分辨率图谱。在图谱中的T1加权和T2w图像都通过非局部均值方法从原始分辨率上采样至近似各向同性分辨率0.4x0.4x0.52毫米3。使用基于UNet的神经网络将图谱内的手动分割也上采样到该分辨率，此神经网络在包含高分辨率离体和低分辨率各向异性体内MRI的手动分割队列中进行训练。其次，在这个近似各向同性的图谱内训练了多模态深度学习基础的分割模型。最后，实验表明，近似各向同性亚区分割提高了T2w MRI作为神经退行性疾病成像生物标志物的皮质厚度准确性。

URL

https://arxiv.org/abs/2504.18442

PDF

https://arxiv.org/pdf/2504.18442.pdf
Read All
NUDF: Neural Unsigned Distance Fields for high resolution 3D medical image segmentation

2025-04-25 13:32:16

Kristine S{\o}rensen, Oscar Camara, Ole de Backer, Klaus Kofoed, Rasmus Paulsen

arXiv_CV

arXiv_CV Segmentation Face Pose Medical 3D
Abstract

Medical image segmentation is often considered as the task of labelling each pixel or voxel as being inside or outside a given anatomy. Processing the images at their original size and resolution often result in insuperable memory requirements, but downsampling the images leads to a loss of important details. Instead of aiming to represent a smooth and continuous surface in a binary voxel-grid, we propose to learn a Neural Unsigned Distance Field (NUDF) directly from the image. The small memory requirements of NUDF allow for high resolution processing, while the continuous nature of the distance field allows us to create high resolution 3D mesh models of shapes of any topology (i.e. open surfaces). We evaluate our method on the task of left atrial appendage (LAA) segmentation from Computed Tomography (CT) images. The LAA is a complex and highly variable shape, being thus difficult to represent with traditional segmentation methods using discrete labelmaps. With our proposed method, we are able to predict 3D mesh models that capture the details of the LAA and achieve accuracy in the order of the voxel spacing in the CT images.

Abstract (translated)

医学图像分割通常被视为将每个像素或体素标记为给定解剖结构内部或外部的任务。在原始大小和分辨率下处理这些图像往往会导致难以克服的内存需求问题，而对图像进行降采样则会丢失重要的细节信息。我们提出了一种不同于传统方法的新方案：直接从图像中学习神经无符号距离场（NUDF）。这种方法的小内存需求使得高分辨率处理成为可能，同时距离场的连续性特性允许我们创建任何拓扑结构（即开放表面）的高质量3D网格模型。我们在左心耳（LAA）分割任务上评估了该方法的有效性，所用数据是计算机断层扫描图像。左心耳是一种复杂且高度变化的形状，因此使用传统的基于离散标签图的方法对其进行表示十分困难。通过我们提出的方法，能够预测捕捉到LAA细节的3D网格模型，并能达到与CT图像中体素间距相匹配的精度水平。

URL

https://arxiv.org/abs/2504.18344

PDF

https://arxiv.org/pdf/2504.18344.pdf
Read All
A Data-Centric Approach to 3D Semantic Segmentation of Railway Scenes

2025-04-25 09:46:31

Nicolas M\"unger, Max Peter Ronecker, Xavier Diaz, Michael Karner, Daniel Watzenig, Jan Skaloud

arXiv_CV

arXiv_CV Segmentation Semantic_Segmentation Prediction Autonomous 3D
Abstract

LiDAR-based semantic segmentation is critical for autonomous trains, requiring accurate predictions across varying distances. This paper introduces two targeted data augmentation methods designed to improve segmentation performance on the railway-specific OSDaR23 dataset. The person instance pasting method enhances segmentation of pedestrians at distant ranges by injecting realistic variations into the dataset. The track sparsification method redistributes point density in LiDAR scans, improving track segmentation at far distances with minimal impact on close-range accuracy. Both methods are evaluated using a state-of-the-art 3D semantic segmentation network, demonstrating significant improvements in distant-range performance while maintaining robustness in close-range predictions. We establish the first 3D semantic segmentation benchmark for OSDaR23, demonstrating the potential of data-centric approaches to address railway-specific challenges in autonomous train perception.

Abstract (translated)

基于激光雷达的语义分割对于自主列车至关重要，需要在不同距离范围内实现准确预测。本文介绍两种针对铁路特定数据集OSDaR23设计的目标化数据增强方法，旨在提高语义分割性能。其中，“行人实例粘贴”方法通过向数据集中注入真实的变化来提升对远距离行人的分割精度。“轨道稀疏化”方法重新分布激光雷达扫描中的点密度，在不影响近距离准确性的前提下改善远处的轨道分割效果。这两种方法均使用最先进的3D语义分割网络进行了评估，结果显示在远距离性能方面有显著改进，并且保持了近距离预测的鲁棒性。我们建立了OSDaR23的第一个3D语义分割基准测试，证明数据为中心的方法具有解决自主列车感知中特定铁路挑战的巨大潜力。

URL

https://arxiv.org/abs/2504.18213

PDF

https://arxiv.org/pdf/2504.18213.pdf
Read All
Multi-Grained Compositional Visual Clue Learning for Image Intent Recognition

2025-04-25 09:31:03

Yin Tang, Jiankai Li, Hongyu Yang, Xuan Dong, Lifeng Fan, Weixin Li

arXiv_AI

arXiv_AI Segmentation Semantic_Segmentation CNN Recognition Detection Object_Detection Classification Embedding Relation Knowledge Pose
Abstract

In an era where social media platforms abound, individuals frequently share images that offer insights into their intents and interests, impacting individual life quality and societal stability. Traditional computer vision tasks, such as object detection and semantic segmentation, focus on concrete visual representations, while intent recognition relies more on implicit visual clues. This poses challenges due to the wide variation and subjectivity of such clues, compounded by the problem of intra-class variety in conveying abstract concepts, e.g. "enjoy life". Existing methods seek to solve the problem by manually designing representative features or building prototypes for each class from global features. However, these methods still struggle to deal with the large visual diversity of each intent category. In this paper, we introduce a novel approach named Multi-grained Compositional visual Clue Learning (MCCL) to address these challenges for image intent recognition. Our method leverages the systematic compositionality of human cognition by breaking down intent recognition into visual clue composition and integrating multi-grained features. We adopt class-specific prototypes to alleviate data imbalance. We treat intent recognition as a multi-label classification problem, using a graph convolutional network to infuse prior knowledge through label embedding correlations. Demonstrated by a state-of-the-art performance on the Intentonomy and MDID datasets, our approach advances the accuracy of existing methods while also possessing good interpretability. Our work provides an attempt for future explorations in understanding complex and miscellaneous forms of human expression.

Abstract (translated)

在社交媒体平台盛行的时代，个人经常分享揭示其意图和兴趣的图片，这些行为影响着个体的生活质量和社会的稳定性。传统的计算机视觉任务，如物体检测和语义分割，侧重于具体的视觉表现形式，而意图识别则更多地依赖隐含的视觉线索。由于此类线索的变化多样性和主观性，加之表达抽象概念（例如“享受生活”）时存在类别内的多样性问题，给这一领域带来了挑战。现有的方法试图通过手动设计代表性特征或从全局特征中为每个类别构建原型来解决这些问题，但这些方法仍难以应对每种意图类别的巨大视觉多样性。本文介绍了一种新的方法——多粒度组合式视觉线索学习（MCCL），旨在解决图像意图识别中的上述挑战。我们的方法通过将意图识别分解成视觉线索的组成，并整合多粒度特征来利用人类认知系统化的组合性。我们采用类别特定的原型以缓解数据不平衡的问题。我们将意图识别视为一个多标签分类问题，使用图卷积网络通过标签嵌入的相关性来引入先验知识。在Intentonomy和MDID数据集上进行的实验表明，我们的方法不仅提高了现有方法的准确性，还具有良好的可解释性。我们的工作为未来探索人类复杂多样的表达形式提供了尝试。

URL

https://arxiv.org/abs/2504.18201

PDF

https://arxiv.org/pdf/2504.18201.pdf
Read All
What is the Added Value of UDA in the VFM Era?

2025-04-25 09:10:10

Brun\'o B. Englert, Tommie Kerssies, Gijs Dubbelman

arXiv_CV

arXiv_CV Segmentation Semantic_Segmentation Transformer Unsupervised Autonomous
Abstract

Unsupervised Domain Adaptation (UDA) can improve a perception model's generalization to an unlabeled target domain starting from a labeled source domain. UDA using Vision Foundation Models (VFMs) with synthetic source data can achieve generalization performance comparable to fully-supervised learning with real target data. However, because VFMs have strong generalization from their pre-training, more straightforward, source-only fine-tuning can also perform well on the target. As data scenarios used in academic research are not necessarily representative for real-world applications, it is currently unclear (a) how UDA behaves with more representative and diverse data and (b) if source-only fine-tuning of VFMs can perform equally well in these scenarios. Our research aims to close these gaps and, similar to previous studies, we focus on semantic segmentation as a representative perception task. We assess UDA for synth-to-real and real-to-real use cases with different source and target data combinations. We also investigate the effect of using a small amount of labeled target data in UDA. We clarify that while these scenarios are more realistic, they are not necessarily more challenging. Our results show that, when using stronger synthetic source data, UDA's improvement over source-only fine-tuning of VFMs reduces from +8 mIoU to +2 mIoU, and when using more diverse real source data, UDA has no added value. However, UDA generalization is always higher in all synthetic data scenarios than source-only fine-tuning and, when including only 1/16 of Cityscapes labels, synthetic UDA obtains the same state-of-the-art segmentation quality of 85 mIoU as a fully-supervised model using all labels. Considering the mixed results, we discuss how UDA can best support robust autonomous driving at scale.

Abstract (translated)

无监督领域适应（UDA）可以从标记的源域开始，提高感知模型在未标记的目标域上的泛化能力。使用视觉基础模型（VFMs）和合成数据作为来源进行UDA可以在性能上接近完全有监督的学习方法，并且其效果相当于用真实目标数据训练的结果。然而，由于VFMs在其预训练过程中已经具备了很强的泛化能力，在只利用源领域数据的情况下对其进行微调也能获得较好的结果。鉴于学术研究中使用的数据场景未必能代表现实世界的应用情况，目前尚不清楚UDA在更具有代表性且多样化的数据下表现如何，以及仅对VFMs进行源域数据微调是否同样有效。我们的研究旨在填补这些空白，并与以往的研究类似，我们将语义分割作为典型的感知任务来关注。我们评估了UDA在从合成到真实场景和纯现实场景中使用不同来源和目标数据组合的性能表现，并探讨了UDA方法在使用少量标记的目标域数据时的效果。虽然上述情景更加贴近实际情况，但并不一定更具有挑战性。我们的研究结果显示，在更强的合成源数据环境下，UDA相对于对VFMs进行纯粹的源域微调的优势从+8 mIoU减少到+2 mIoU；而在使用更多样化的现实源数据的情况下，UDA并没有表现出额外的价值。然而，无论在何种合成数据场景下，UDA的泛化性能始终优于只基于源领域的模型调整，并且当仅包括Cityscapes标签的1/16时，合成UDAA达到了85 mIoU的分割质量，这与使用所有标签进行完全监督训练的效果相同。考虑到这些混合结果，我们讨论了如何使UDA能够更好地支持大规模的自动驾驶应用。

URL

https://arxiv.org/abs/2504.18190

PDF

https://arxiv.org/pdf/2504.18190.pdf
Read All
E-InMeMo: Enhanced Prompting for Visual In-Context Learning

2025-04-25 08:12:58

Jiahao Zhang, Bowen Wang, Hong Liu, Liangzhi Li, Yuta Nakashima, Hajime Nagahara

arXiv_CV

arXiv_CV Segmentation Detection Object_Detection Pose
Abstract

Large-scale models trained on extensive datasets have become the standard due to their strong generalizability across diverse tasks. In-context learning (ICL), widely used in natural language processing, leverages these models by providing task-specific prompts without modifying their parameters. This paradigm is increasingly being adapted for computer vision, where models receive an input-output image pair, known as an in-context pair, alongside a query image to illustrate the desired output. However, the success of visual ICL largely hinges on the quality of these prompts. To address this, we propose Enhanced Instruct Me More (E-InMeMo), a novel approach that incorporates learnable perturbations into in-context pairs to optimize prompting. Through extensive experiments on standard vision tasks, E-InMeMo demonstrates superior performance over existing state-of-the-art methods. Notably, it improves mIoU scores by 7.99 for foreground segmentation and by 17.04 for single object detection when compared to the baseline without learnable prompts. These results highlight E-InMeMo as a lightweight yet effective strategy for enhancing visual ICL. Code is publicly available at: this https URL

Abstract (translated)

大规模模型在广泛的数据集上训练，由于其强大的跨多种任务的泛化能力而成为标准。在自然语言处理中广泛使用的上下文学习（ICL）通过提供特定于任务的提示来利用这些模型而不修改它们的参数。这一范式越来越多地被用于计算机视觉领域，在这个领域中，模型接收输入-输出图像对，被称为上下文对，并且还有一个查询图像以展示期望的输出。然而，视觉ICL的成功在很大程度上取决于这些提示的质量。为了解决这个问题，我们提出了Enhanced Instruct Me More (E-InMeMo)，这是一种新颖的方法，它将可学习的扰动整合到上下文中以优化提示设计。通过在标准视觉任务上的广泛实验，E-InMeMo展示了优于现有最先进方法的表现。值得注意的是，在与无可学习提示基线相比时，E-InMeMo在前景分割上提升了7.99的mIoU得分，并在单个对象检测中提高了17.04分。这些结果凸显了E-InMeMo作为增强视觉ICL的轻量级且有效策略的地位。代码可在以下链接公开获取：[此URL](https://this-url.com)

URL

https://arxiv.org/abs/2504.18158

PDF

https://arxiv.org/pdf/2504.18158.pdf
Read All
A Large Vision-Language Model based Environment Perception System for Visually Impaired People

2025-04-25 02:46:22

Zezhou Chen, Zhaoxiang Liu, Kai Wang, Kohou Wang, Shiguo Lian

arXiv_AI

arXiv_AI Segmentation Face QA Knowledge Language_Model Transformer Pose Chat
Abstract

It is a challenging task for visually impaired people to perceive their surrounding environment due to the complexity of the natural scenes. Their personal and social activities are thus highly limited. This paper introduces a Large Vision-Language Model(LVLM) based environment perception system which helps them to better understand the surrounding environment, by capturing the current scene they face with a wearable device, and then letting them retrieve the analysis results through the device. The visually impaired people could acquire a global description of the scene by long pressing the screen to activate the LVLM output, retrieve the categories of the objects in the scene resulting from a segmentation model by tapping or swiping the screen, and get a detailed description of the objects they are interested in by double-tapping the screen. To help visually impaired people more accurately perceive the world, this paper proposes incorporating the segmentation result of the RGB image as external knowledge into the input of LVLM to reduce the LVLM's hallucination. Technical experiments on POPE, MME and LLaVA-QA90 show that the system could provide a more accurate description of the scene compared to Qwen-VL-Chat, exploratory experiments show that the system helps visually impaired people to perceive the surrounding environment effectively.

Abstract (translated)

对于视障人士而言，由于自然场景的复杂性，感知周围环境是一项具有挑战性的任务，这大大限制了他们的个人和社会活动。本文介绍了一种基于大型视觉语言模型（LVLM）的环境感知系统，该系统通过可穿戴设备捕捉他们面临的当前场景，并允许他们在设备上检索分析结果，从而帮助视障人士更好地理解周围的环境。视障人士可以通过长按屏幕激活LVLM输出来获得场景的整体描述；通过点击或滑动屏幕获取分割模型得出的场景中物体类别信息；并通过双击感兴趣的物体来获取详细的描述。为了使视障人士能够更准确地感知世界，本文提出将RGB图像的分割结果作为外部知识引入LVLM输入，以减少LVLM的幻觉现象。在POPE、MME和LLaVA-QA90上的技术实验表明，该系统能比Qwen-VL-Chat提供更为精确的场景描述。探索性试验显示，该系统有助于视障人士有效感知周围的环境。

URL

https://arxiv.org/abs/2504.18027

PDF

https://arxiv.org/pdf/2504.18027.pdf
Read All
Federated Client-tailored Adapter for Medical Image Segmentation

2025-04-25 02:20:25

Guyue Hu, Siyuan Song, Yukun Kang, Zhu Yin, Gangming Zhao, Chenglong Li, Jin Tang

arXiv_CV

arXiv_CV Segmentation Knowledge Pose Medical
Abstract

Medical image segmentation in X-ray images is beneficial for computer-aided diagnosis and lesion localization. Existing methods mainly fall into a centralized learning paradigm, which is inapplicable in the practical medical scenario that only has access to distributed data islands. Federated Learning has the potential to offer a distributed solution but struggles with heavy training instability due to client-wise domain heterogeneity (including distribution diversity and class imbalance). In this paper, we propose a novel Federated Client-tailored Adapter (FCA) framework for medical image segmentation, which achieves stable and client-tailored adaptive segmentation without sharing sensitive local data. Specifically, the federated adapter stirs universal knowledge in off-the-shelf medical foundation models to stabilize the federated training process. In addition, we develop two client-tailored federated updating strategies that adaptively decompose the adapter into common and individual components, then globally and independently update the parameter groups associated with common client-invariant and individual client-specific units, respectively. They further stabilize the heterogeneous federated learning process and realize optimal client-tailored instead of sub-optimal global-compromised segmentation models. Extensive experiments on three large-scale datasets demonstrate the effectiveness and superiority of the proposed FCA framework for federated medical segmentation.

Abstract (translated)

在X射线图像中的医学影像分割对于计算机辅助诊断和病灶定位非常有益。现有的方法主要基于集中式学习框架，这种方法不适用于只能访问分散数据孤岛的实际医疗场景。联邦学习（Federated Learning）有可能提供一种分布式的解决方案，但因客户端域异质性（包括分布多样性和类别不平衡）而导致训练不稳定的问题使其难以实现这一目标。为此，在这篇论文中，我们提出了一种新颖的医学影像分割框架——联邦定制适配器(FCA)，该框架能够在不分享敏感本地数据的情况下，达到稳定且个性化的自适应分割效果。具体而言，联邦适配器搅动现成的医疗基础模型中的通用知识以稳定联邦训练过程。此外，我们开发了两种客户端特定制化的联邦更新策略，它们能够将适配器分解为公共部分和个体部分，并分别独立地更新与通用不变部分（适用于所有客户端）以及特定个体部分相关的参数组。这些方法进一步稳定了异质性的联邦学习过程，并实现了最优的客户端定制化分割模型而非次优的整体妥协模型。在三个大规模数据集上的大量实验验证了所提出的FCA框架在联邦医学影像分割中的有效性和优越性。

URL

https://arxiv.org/abs/2504.18020

PDF

https://arxiv.org/pdf/2504.18020.pdf
Read All
Back to Fundamentals: Low-Level Visual Features Guided Progressive Token Pruning

2025-04-25 00:43:20

Yuanbing Ouyang, Yizhuo Liang, Qingpeng Li, Xinfei Guo, Yiming Luo, Di Wu, Hao Wang, Yushan Pan

arXiv_CV

arXiv_CV Segmentation Semantic_Segmentation Transformer
Abstract

Vision Transformers (ViTs) excel in semantic segmentation but demand significant computation, posing challenges for deployment on resource-constrained devices. Existing token pruning methods often overlook fundamental visual data characteristics. This study introduces 'LVTP', a progressive token pruning framework guided by multi-scale Tsallis entropy and low-level visual features with twice clustering. It integrates high-level semantics and basic visual attributes for precise segmentation. A novel dynamic scoring mechanism using multi-scale Tsallis entropy weighting overcomes limitations of traditional single-parameter entropy. The framework also incorporates low-level feature analysis to preserve critical edge information while optimizing computational cost. As a plug-and-play module, it requires no architectural changes or additional training. Evaluations across multiple datasets show 20%-45% computational reductions with negligible performance loss, outperforming existing methods in balancing cost and accuracy, especially in complex edge regions.

Abstract (translated)

视觉变压器（ViT）在语义分割方面表现出色，但其计算需求较高，这给资源受限设备上的部署带来了挑战。现有的令牌修剪方法往往忽视了基本的视觉数据特征。本研究引入了一种名为“LVTP”的渐进式令牌修剪框架，该框架通过多尺度Tsallis熵和两次聚类分析的基本视觉特征进行引导。该框架结合高层次语义和基本视觉属性以实现精确分割。一种新颖的动态评分机制采用基于多尺度Tsallis熵加权的方法来克服传统单一参数熵的局限性。此外，该框架还纳入了对低级特征的分析，以便在优化计算成本的同时保留关键边缘信息。作为即插即用模块，LVTP无需架构修改或额外训练。跨多个数据集的评估显示，在性能损失可忽略不计的情况下，该方法实现了20%-45%的计算量减少，这使得其在平衡成本与准确性方面优于现有的方法，尤其是在复杂的边缘区域表现更为突出。

URL

https://arxiv.org/abs/2504.17996

PDF

https://arxiv.org/pdf/2504.17996.pdf
Read All
Virtual Roads, Smarter Safety: A Digital Twin Framework for Mixed Autonomous Traffic Safety Analysis

2025-04-24 22:27:59

Hao Zhang, Ximin Yue, Kexin Tian, Sixu Li, Keshu Wu, Zihao Li, Dominique Lord, Yang Zhou

arXiv_RO

arXiv_RO Segmentation Semantic_Segmentation Drone Pose Autonomous Action 3D
Abstract

This paper presents a digital-twin platform for active safety analysis in mixed traffic environments. The platform is built using a multi-modal data-enabled traffic environment constructed from drone-based aerial LiDAR, OpenStreetMap, and vehicle sensor data (e.g., GPS and inclinometer readings). High-resolution 3D road geometries are generated through AI-powered semantic segmentation and georeferencing of aerial LiDAR data. To simulate real-world driving scenarios, the platform integrates the CAR Learning to Act (CARLA) simulator, Simulation of Urban MObility (SUMO) traffic model, and NVIDIA PhysX vehicle dynamics engine. CARLA provides detailed micro-level sensor and perception data, while SUMO manages macro-level traffic flow. NVIDIA PhysX enables accurate modeling of vehicle behaviors under diverse conditions, accounting for mass distribution, tire friction, and center of mass. This integrated system supports high-fidelity simulations that capture the complex interactions between autonomous and conventional vehicles. Experimental results demonstrate the platform's ability to reproduce realistic vehicle dynamics and traffic scenarios, enhancing the analysis of active safety measures. Overall, the proposed framework advances traffic safety research by enabling in-depth, physics-informed evaluation of vehicle behavior in dynamic and heterogeneous traffic environments.

Abstract (translated)

本文提出了一种用于混合交通环境中主动安全分析的数字孪生平台。该平台基于多模态数据构建的交通环境，利用无人机航拍LiDAR、OpenStreetMap以及车辆传感器（如GPS和倾角仪读数）的数据。通过人工智能驱动的语义分割和地理参考处理航拍LiDAR数据来生成高分辨率的三维道路几何图形。为了模拟真实的驾驶场景，该平台集成了CAR Learning to Act (CARLA)仿真器、城市移动仿真实验（SUMO）交通模型以及NVIDIA PhysX车辆动力学引擎。CARLA提供了详细的微观级别的传感器和感知数据，而SUMO则管理宏观层面的交通流量。NVIDIA PhysX能够准确模拟不同条件下的车辆行为，考虑质量分布、轮胎摩擦力和质心位置的影响。该集成系统支持高保真的仿真，捕捉自主与传统车辆之间的复杂相互作用。实验结果表明，平台有能力再现真实的车辆动态及交通场景，从而增强对主动安全措施的分析能力。总体而言，所提出的框架通过实现对动态且异构交通环境中车辆行为的深入、基于物理原理的评估，推进了交通安全研究的发展。

URL

https://arxiv.org/abs/2504.17968

PDF

https://arxiv.org/pdf/2504.17968.pdf
Read All
Predicting Dairy Calf Body Weight from Depth Images Using Deep Learning and Threshold Segmentation with Cross-Validation and Longitudinal Analysis

2025-04-24 21:08:31

Mingsi Liao, Gota Morota, Ye Bi, Rebecca R. Cockrum

arXiv_CV

arXiv_CV Segmentation Deep_Learning Prediction
Abstract

Monitoring calf body weight (BW) before weaning is essential for assessing growth, feed efficiency, health, and weaning readiness. However, labor, time, and facility constraints limit BW collection. Additionally, Holstein calf coat patterns complicate image-based BW estimation, and few studies have explored non-contact measurements taken at early time points for predicting later BW. The objectives of this study were to (1) develop deep learning-based segmentation models for extracting calf body metrics, (2) compare deep learning segmentation with threshold-based methods, and (3) evaluate BW prediction using single-time-point cross-validation with linear regression (LR) and extreme gradient boosting (XGBoost) and multiple-time-point cross-validation with LR, XGBoost, and a linear mixed model (LMM). Depth images from Holstein (n = 63) and Jersey (n = 5) pre-weaning calves were collected, with 20 Holstein calves being weighed manually. Results showed that You Only Look Once version 8 (YOLOv8) deep learning segmentation (intersection over union = 0.98) outperformed threshold-based methods (0.89). In single-time-point cross-validation, XGBoost achieved the best BW prediction (R^2 = 0.91, mean absolute percentage error (MAPE) = 4.37%), while LMM provided the most accurate longitudinal BW prediction (R^2 = 0.99, MAPE = 2.39%). These findings highlight the potential of deep learning for automated BW prediction, enhancing farm management.

Abstract (translated)

监测断奶前犊牛体重（BW）对于评估其生长、饲料效率、健康状况和断奶准备情况至关重要。然而，劳动力、时间和设施的限制阻碍了对体重数据的采集。此外，荷斯坦犊牛的皮毛图案使得基于图像估计体重变得复杂，并且很少有研究探索早期非接触式测量用于预测后期体重的方法。本研究的目标是（1）开发一种基于深度学习的分割模型来提取犊牛体格指标；（2）将深度学习分割与阈值方法进行比较；以及（3）使用单时间点交叉验证和线性回归（LR）、极端梯度提升（XGBoost），多时间点交叉验证结合LR、XGBoost以及线性混合效应模型（LMM）来评估体重预测。本研究收集了63头荷斯坦犊牛和5头泽西犊牛的深度图像，其中20头荷斯坦犊牛通过手动称重。结果表明，You Only Look Once版本8（YOLOv8）深度学习分割方法（交并比= 0.98）优于阈值方法（0.89）。在单时间点交叉验证中，XGBoost实现了最佳的体重预测（R^2 = 0.91，平均绝对百分比误差(MAPE) = 4.37%），而线性混合模型(LMM)提供了最准确的纵向体重预测（R^2 = 0.99，MAPE = 2.39%）。这些发现强调了深度学习在自动化体重预测中的潜力，并有助于提高农场管理水平。

URL

https://arxiv.org/abs/2504.17943

PDF

https://arxiv.org/pdf/2504.17943.pdf
Read All
Masked strategies for images with small objects

2025-04-24 20:52:23

H. Martin Gillis, Ming Hill, Paul Hollensen, Alan Fine, Thomas Trappenberg

arXiv_CV

arXiv_CV Segmentation Semantic_Segmentation Detection Deep_Learning Classification Transformer Pose Self-Supervised Reconstruction
Abstract

The hematology analytics used for detection and classification of small blood components is a significant challenge. In particular, when objects exists as small pixel-sized entities in a large context of similar objects. Deep learning approaches using supervised models with pre-trained weights, such as residual networks and vision transformers have demonstrated success for many applications. Unfortunately, when applied to images outside the domain of learned representations, these methods often result with less than acceptable performance. A strategy to overcome this can be achieved by using self-supervised models, where representations are learned and weights are then applied for downstream applications. Recently, masked autoencoders have proven to be effective to obtain representations that captures global context information. By masking regions of an image and having the model learn to reconstruct both the masked and non-masked regions, weights can be used for various applications. However, if the sizes of the objects in images are less than the size of the mask, the global context information is lost, making it almost impossible to reconstruct the image. In this study, we investigated the effect of mask ratios and patch sizes for blood components using a MAE to obtain learned ViT encoder representations. We then applied the encoder weights to train a U-Net Transformer for semantic segmentation to obtain both local and global contextual information. Our experimental results demonstrates that both smaller mask ratios and patch sizes improve the reconstruction of images using a MAE. We also show the results of semantic segmentation with and without pre-trained weights, where smaller-sized blood components benefited with pre-training. Overall, our proposed method offers an efficient and effective strategy for the segmentation and classification of small objects.

Abstract (translated)

用于检测和分类小血液成分的血细胞分析是一个重要的挑战，尤其是在大量类似对象的大背景下，这些对象以像素大小的小实体存在。深度学习方法利用预训练权重的监督模型（如残差网络和视觉变换器）在许多应用中已经展示了成功的结果。然而，当应用于领域外图像时，这些方法往往表现出不令人满意的效果。通过使用自监督模型来克服这一问题是一种策略，在这种情况下，表示是通过自我学习得到的，并且权重随后被用于下游应用。最近，掩码自动编码器已被证明能够有效获取包含全局上下文信息的表示方式。通过对图像中的某些区域进行遮蔽并让模型学会重构这些被遮蔽和未被遮蔽的区域，可以在各种应用中使用这些权重。然而，如果图象中的对象尺寸小于遮蔽块的大小，则会丢失全局上下文信息，使得几乎无法重建图像。在这项研究中，我们调查了血细胞成分在掩码自动编码器（MAE）中不同掩码比率和补丁大小的效果，以获得学习到的视觉变换器（ViT）编码器表示。然后我们将编码器权重应用于训练 U-Net 变换器进行语义分割，以便获取局部和全局上下文信息。我们的实验结果表明，在使用 MAE 进行图像重建时，较小的掩码比率和补丁大小都有所改善。我们还展示了有无预训练权重情况下语义分割的结果，其中小尺寸的血液成分受益于预训练。总体而言，我们提出的方法为小对象的分割和分类提供了一种有效且高效的战略。

URL

https://arxiv.org/abs/2504.17935

PDF

https://arxiv.org/pdf/2504.17935.pdf
Read All
Beyond Labels: Zero-Shot Diabetic Foot Ulcer Wound Segmentation with Self-attention Diffusion Models and the Potential for Text-Guided Customization

2025-04-24 14:50:10

Abderrachid Hamrani, Daniela Leizaola, Renato Sousa, Jose P. Ponce, Stanley Mathis, David G. Armstrong, Anuradha Godavarty

arXiv_CV

arXiv_CV Segmentation Deep_Learning Attention Inference Unsupervised Pose Zero-Shot Medical Diffusion
Abstract

Diabetic foot ulcers (DFUs) pose a significant challenge in healthcare, requiring precise and efficient wound assessment to enhance patient outcomes. This study introduces the Attention Diffusion Zero-shot Unsupervised System (ADZUS), a novel text-guided diffusion model that performs wound segmentation without relying on labeled training data. Unlike conventional deep learning models, which require extensive annotation, ADZUS leverages zero-shot learning to dynamically adapt segmentation based on descriptive prompts, offering enhanced flexibility and adaptability in clinical applications. Experimental evaluations demonstrate that ADZUS surpasses traditional and state-of-the-art segmentation models, achieving an IoU of 86.68\% and the highest precision of 94.69\% on the chronic wound dataset, outperforming supervised approaches such as FUSegNet. Further validation on a custom-curated DFU dataset reinforces its robustness, with ADZUS achieving a median DSC of 75\%, significantly surpassing FUSegNet's 45\%. The model's text-guided segmentation capability enables real-time customization of segmentation outputs, allowing targeted analysis of wound characteristics based on clinical descriptions. Despite its competitive performance, the computational cost of diffusion-based inference and the need for potential fine-tuning remain areas for future improvement. ADZUS represents a transformative step in wound segmentation, providing a scalable, efficient, and adaptable AI-driven solution for medical imaging.

Abstract (translated)

糖尿病足溃疡（DFU）在医疗保健中是一个严峻的挑战，需要精确和高效的伤口评估以改善患者预后。本研究引入了一种新的文本引导扩散模型——注意力扩散零样本无监督系统（ADZUS），该系统能够在无需标注训练数据的情况下执行伤口分割。与传统的深度学习模型不同，后者需要大量的注释工作，ADZUS利用零样本学习根据描述性提示动态调整分割结果，从而在临床应用中提供了更高的灵活性和适应性。实验评估表明，ADZUS的表现超过了传统方法和当前最先进的分割模型，在慢性伤口数据集上实现了86.68%的交并比（IoU）和最高的精确度94.69%，优于监督式方法如FUSegNet。在针对DFU定制的数据集上的进一步验证表明其鲁棒性，ADZUS达到了75%的中位数Dice相似系数（DSC），远超FUSegNet的45%。该模型的文本引导分割能力使其能够根据临床描述实时自定义分割输出，允许对伤口特性进行针对性分析。尽管性能具有竞争力，但基于扩散推理的计算成本以及可能需要潜在微调的需求仍然是未来改进的方向。 ADZUS代表了在伤口分割领域中的一个变革性步骤，提供了一种可扩展、高效且适应性强的人工智能解决方案，为医学影像应用带来了革命性的进步。

URL

https://arxiv.org/abs/2504.17628

PDF

https://arxiv.org/pdf/2504.17628.pdf
Read All
Occlusion-Aware Self-Supervised Monocular Depth Estimation for Weak-Texture Endoscopic Images

2025-04-24 14:12:57

Zebo Huang, Yinghui Wang

arXiv_CV

arXiv_CV Segmentation Semantic_Segmentation CNN SLAM Pose Self-Supervised Reconstruction
Abstract

We propose a self-supervised monocular depth estimation network tailored for endoscopic scenes, aiming to infer depth within the gastrointestinal tract from monocular images. Existing methods, though accurate, typically assume consistent illumination, which is often violated due to dynamic lighting and occlusions caused by GI motility. These variations lead to incorrect geometric interpretations and unreliable self-supervised signals, degrading depth reconstruction quality. To address this, we introduce an occlusion-aware self-supervised framework. First, we incorporate an occlusion mask for data augmentation, generating pseudo-labels by simulating viewpoint-dependent occlusion scenarios. This enhances the model's ability to learn robust depth features under partial visibility. Second, we leverage semantic segmentation guided by non-negative matrix factorization, clustering convolutional activations to generate pseudo-labels in texture-deprived regions, thereby improving segmentation accuracy and mitigating information loss from lighting changes. Experimental results on the SCARED dataset show that our method achieves state-of-the-art performance in self-supervised depth estimation. Additionally, evaluations on the Endo-SLAM and SERV-CT datasets demonstrate strong generalization across diverse endoscopic environments.

Abstract (translated)

我们提出了一种针对内窥镜场景的自监督单目深度估计网络，旨在从单目图像中推断胃肠道内的深度信息。现有的方法虽然准确，但通常假设光照一致，而这种条件常常因动态照明和由胃肠运动引起的遮挡而被破坏。这些变化会导致错误的几何解释，并产生不可靠的自监督信号，从而降低深度重建质量。为解决这些问题，我们引入了一种感知遮挡的自监督框架。首先，我们在数据增强中加入了遮挡掩码，通过模拟视点依赖性遮挡场景生成伪标签，这增强了模型在部分可见情况下的鲁棒深度特征学习能力。其次，我们利用非负矩阵分解指导的语义分割，通过对卷积激活进行聚类来生成纹理匮乏区域中的伪标签，从而提高分割精度并减轻光照变化引起的信息损失。在SCARED数据集上的实验结果显示，我们的方法在自监督深度估计方面达到了最先进的性能。此外，在Endo-SLAM和SERV-CT数据集上的评估表明，该方法具有很强的跨多种内窥镜环境的泛化能力。

URL

https://arxiv.org/abs/2504.17582

PDF

https://arxiv.org/pdf/2504.17582.pdf
Read All
Mamba-Sea: A Mamba-based Framework with Global-to-Local Sequence Augmentation for Generalizable Medical Image Segmentation

2025-04-24 12:57:25

Zihan Cheng, Jintao Guo, Jian Zhang, Lei Qi, Luping Zhou, Yinghuan Shi, Yang Gao

arXiv_CV

arXiv_CV Segmentation Knowledge Pose Medical
Abstract

To segment medical images with distribution shifts, domain generalization (DG) has emerged as a promising setting to train models on source domains that can generalize to unseen target domains. Existing DG methods are mainly based on CNN or ViT architectures. Recently, advanced state space models, represented by Mamba, have shown promising results in various supervised medical image segmentation. The success of Mamba is primarily owing to its ability to capture long-range dependencies while keeping linear complexity with input sequence length, making it a promising alternative to CNNs and ViTs. Inspired by the success, in the paper, we explore the potential of the Mamba architecture to address distribution shifts in DG for medical image segmentation. Specifically, we propose a novel Mamba-based framework, Mamba-Sea, incorporating global-to-local sequence augmentation to improve the model's generalizability under domain shift issues. Our Mamba-Sea introduces a global augmentation mechanism designed to simulate potential variations in appearance across different sites, aiming to suppress the model's learning of domain-specific information. At the local level, we propose a sequence-wise augmentation along input sequences, which perturbs the style of tokens within random continuous sub-sequences by modeling and resampling style statistics associated with domain shifts. To our best knowledge, Mamba-Sea is the first work to explore the generalization of Mamba for medical image segmentation, providing an advanced and promising Mamba-based architecture with strong robustness to domain shifts. Remarkably, our proposed method is the first to surpass a Dice coefficient of 90% on the Prostate dataset, which exceeds previous SOTA of 88.61%. The code is available at this https URL.

Abstract (translated)

为了处理医学图像中的分布偏移，领域泛化（Domain Generalization, DG）作为一种有前景的方法被提出，用于训练模型在源域上进行学习，并能够推广到未见过的目标域。现有的DG方法主要基于CNN或ViT架构。最近，由Mamba为代表的先进状态空间模型，在各种监督下的医学图像分割任务中显示出令人鼓舞的结果。Mamba的成功主要归功于其捕获长距离依赖的能力，同时保持与输入序列长度呈线性复杂度的关系，使其成为CNN和ViT的有前景替代方案。受此启发，我们在这篇论文中探索了Mamba架构在解决DG领域分布偏移问题中的潜力。具体来说，我们提出了一种基于Mamba的新框架——Mamba-Sea，引入全局到局部序列增强机制来提高模型在域转移情况下的泛化能力。我们的Mamba-Sea引入了一个全局增强机制，旨在模拟不同站点间外观变化的潜在差异，以抑制模型学习特定领域的信息。在局部层面，我们提出了一种沿输入序列进行顺序增强的方法，通过建模和重采样与领域偏移相关的样式统计来扰动随机连续子序列中的令牌风格。据我们所知，Mamba-Sea是第一个探索Mamba泛化能力用于医学图像分割的工作，提供了一个具有强大域转移鲁棒性的先进且有前景的基于Mamba架构的方法。值得注意的是，我们的方法在前列腺数据集上首次实现了Dice系数超过90%，超过了先前最先进的88.61%的结果。代码可在[这里](https://this https URL "请替换为正确的URL")获取。

URL

https://arxiv.org/abs/2504.17515

PDF

https://arxiv.org/pdf/2504.17515.pdf
Read All
CoheMark: A Novel Sentence-Level Watermark for Enhanced Text Quality

2025-04-24 07:08:13

Junyan Zhang, Shuliang Liu, Aiwei Liu, Yubo Gao, Jungang Li, Xiaojie Gu, Xuming Hu

arXiv_CL

arXiv_CL Segmentation Detection Relation Language_Model Pose
Abstract

Watermarking technology is a method used to trace the usage of content generated by large language models. Sentence-level watermarking aids in preserving the semantic integrity within individual sentences while maintaining greater robustness. However, many existing sentence-level watermarking techniques depend on arbitrary segmentation or generation processes to embed watermarks, which can limit the availability of appropriate sentences. This limitation, in turn, compromises the quality of the generated response. To address the challenge of balancing high text quality with robust watermark detection, we propose CoheMark, an advanced sentence-level watermarking technique that exploits the cohesive relationships between sentences for better logical fluency. The core methodology of CoheMark involves selecting sentences through trained fuzzy c-means clustering and applying specific next sentence selection criteria. Experimental evaluations demonstrate that CoheMark achieves strong watermark strength while exerting minimal impact on text quality.

Abstract (translated)

水印技术用于追踪大型语言模型生成内容的使用情况。句子级别的水印技术有助于在单个句子内保持语义完整性，同时提高鲁棒性。然而，许多现有的句子级别水印技术依赖于任意分段或生成过程来嵌入水印，这限制了适当句子的选择范围，进而影响生成响应的质量。为了解决如何在保证高文本质量的同时实现强大的水印检测这一挑战，我们提出了CoheMark，这是一种先进的句子级水印技术，利用句子之间的连贯关系来提升逻辑流畅性。CoheMark的核心方法包括通过训练好的模糊C均值聚类选择句子，并应用特定的下一句子选择标准。实验评估表明，CoheMark在施加最小文本质量影响的同时实现了强大的水印强度。

URL

https://arxiv.org/abs/2504.17309

PDF

https://arxiv.org/pdf/2504.17309.pdf
Read All
Advanced Segmentation of Diabetic Retinopathy Lesions Using DeepLabv3+

2025-04-24 07:00:38

Meher Boulaabi, Takwa Ben A\"icha Gader, Afef Kacem Echi, Sameh Mbarek

arXiv_CV

arXiv_CV Segmentation Optimization Medical
Abstract

To improve the segmentation of diabetic retinopathy lesions (microaneurysms, hemorrhages, exudates, and soft exudates), we implemented a binary segmentation method specific to each type of lesion. As post-segmentation, we combined the individual model outputs into a single image to better analyze the lesion types. This approach facilitated parameter optimization and improved accuracy, effectively overcoming challenges related to dataset limitations and annotation complexity. Specific preprocessing steps included cropping and applying contrast-limited adaptive histogram equalization to the L channel of the LAB image. Additionally, we employed targeted data augmentation techniques to further refine the model's efficacy. Our methodology utilized the DeepLabv3+ model, achieving a segmentation accuracy of 99%. These findings highlight the efficacy of innovative strategies in advancing medical image analysis, particularly in the precise segmentation of diabetic retinopathy lesions. The IDRID dataset was utilized to validate and demonstrate the robustness of our approach.

Abstract (translated)

为了改进糖尿病视网膜病变（包括微动脉瘤、出血、渗出物和软渗出物）的分割效果，我们为每种类型的病灶实施了一种二值分割方法。在后处理阶段，我们将各个模型输出结合成单一图像，以便更好地分析病灶类型。这种方法有助于参数优化并提高了准确性，有效解决了数据集限制和标注复杂性带来的挑战。具体预处理步骤包括裁剪以及对LAB图像的L通道应用对比度受限自适应直方图均衡化。此外，我们还采用了针对性的数据增强技术以进一步提高模型效果。我们的方法利用了DeepLabv3+模型，在分割准确率上达到了99%。这些发现凸显了创新策略在推进医学影像分析中的有效性，特别是在精确分割糖尿病视网膜病变方面。我们使用IDRID数据集来验证并展示这种方法的鲁棒性。

URL

https://arxiv.org/abs/2504.17306

PDF

https://arxiv.org/pdf/2504.17306.pdf
Read All
Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation

2025-04-24 02:41:34

Phillip Y. Lee, Jihyeon Je, Chanho Park, Mikaela Angelina Uy, Leonidas Guibas, Minhyuk Sung

arXiv_CV

arXiv_CV Segmentation Detection Object_Detection Language_Model Transformer Pose Autonomous Action Agent
Abstract

We present a framework for perspective-aware reasoning in vision-language models (VLMs) through mental imagery simulation. Perspective-taking, the ability to perceive an environment or situation from an alternative viewpoint, is a key benchmark for human-level visual understanding, essential for environmental interaction and collaboration with autonomous agents. Despite advancements in spatial reasoning within VLMs, recent research has shown that modern VLMs significantly lack perspective-aware reasoning capabilities and exhibit a strong bias toward egocentric interpretations. To bridge the gap between VLMs and human perception, we focus on the role of mental imagery, where humans perceive the world through abstracted representations that facilitate perspective shifts. Motivated by this, we propose a framework for perspective-aware reasoning, named Abstract Perspective Change (APC), that effectively leverages vision foundation models, such as object detection, segmentation, and orientation estimation, to construct scene abstractions and enable perspective transformations. Our experiments on synthetic and real-image benchmarks, compared with various VLMs, demonstrate significant improvements in perspective-aware reasoning with our framework, further outperforming fine-tuned spatial reasoning models and novel-view-synthesis-based approaches.

Abstract (translated)

我们提出了一种通过心理意象模拟实现视角感知推理的视觉-语言模型（VLM）框架。视角取用能力，即从替代视角感知环境或情境的能力，是人类水平视觉理解的关键指标，在与环境互动和自主代理协作中至关重要。尽管在空间推理方面取得了进展，但最近的研究表明，现代VLM在视角感知推理能力上存在显著不足，并且倾向于以自我中心解释为主导。为了弥合VLM与人类感知之间的差距，我们专注于心理意象的作用，即人类通过抽象表示来感知世界的方式，这种方式可以促进视角转换。受到这一点的启发，我们提出了一个称为抽象视角变换（APC）的框架，该框架有效地利用了视觉基础模型（如目标检测、分割和姿态估计），构建场景抽象并实现视角转变。我们在合成图像和真实图像基准上进行了实验，并将我们的方法与各种VLM进行比较，结果表明，我们的框架在视角感知推理方面取得了显著改善，超越了微调的空间推理模型以及基于新视图合成的方法。

URL

https://arxiv.org/abs/2504.17207

PDF

https://arxiv.org/pdf/2504.17207.pdf
Read All

Content

Segmentation (20)

Segmentation

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF