Discrete video VAEs underpin modern text-to-video generation and video understanding systems, yet existing tokenizers typically learn visual codebooks at a single scale with limited vocabularies and shallow language supervision, leading to poor cross-modal alignment and zero-shot transfer. We introduce PyraTok, a language-aligned pyramidal tokenizer that learns semantically structured discrete latents across multiple spatiotemporal resolutions. PyraTok builds on a pretrained video VAE and a novel Language aligned Pyramidal Quantization (LaPQ) module that discretizes encoder features at several depths using a shared large binary codebook, yielding compact yet expressive video token sequences. To tightly couple visual tokens with language, PyraTok jointly optimizes multi-scale text-guided quantization and a global autoregressive objective over the token hierarchy. Across ten benchmarks, PyraTok delivers state-of-the-art (SOTA) video reconstruction, consistently improves text-to-video quality, and sets new SOTA zero-shot performance on video segmentation, temporal action localization, and video understanding, scaling robustly to up to 4K/8K resolutions.
离散视频变分自编码器(VAEs)是现代文本到视频生成和视频理解系统的基础,然而现有的标记化方法通常在单一尺度上学习有限词汇量的视觉代码本,并且语言监督较浅层,导致跨模态对齐效果不佳及零样本迁移性能差。我们引入了PyraTok,这是一种与语言相匹配的金字塔式标记器,它能够在多个时空分辨率下学习语义结构化的离散潜在变量。PyraTok基于预训练的视频VAE以及一种新颖的语言一致金字塔量化(LaPQ)模块构建而成,该模块通过共享的大二进制代码本来自不同深度对编码特征进行离散化处理,生成紧凑且表达力强的视频标记序列。 为了将视觉标记与语言紧密耦合,PyraTok同时优化多尺度文本引导量化和整个令牌层次上的全局自回归目标。在十项基准测试中,PyraTok提供了最先进的(SOTA)视频重建效果,在文本到视频质量上持续改进,并在视频分割、时间动作定位以及视频理解的零样本性能方面设立新的SOTA标准,能够稳健地扩展至4K/8K分辨率。
https://arxiv.org/abs/2601.16210
Optical Character Recognition (OCR) for low-resource languages remains a significant challenge due to the scarcity of large-scale annotated training datasets. Languages such as Kashmiri, with approximately 7 million speakers and a complex Perso-Arabic script featuring unique diacritical marks, currently lack support in major OCR systems including Tesseract, TrOCR, and PaddleOCR. Manual dataset creation for such languages is prohibitively expensive, time-consuming, and error-prone, often requiring word by word transcription of printed or handwritten text. We present SynthOCR-Gen, an open-source synthetic OCR dataset generator specifically designed for low-resource languages. Our tool addresses the fundamental bottleneck in OCR development by transforming digital Unicode text corpora into ready-to-use training datasets. The system implements a comprehensive pipeline encompassing text segmentation (character, word, n-gram, sentence, and line levels), Unicode normalization with script purity enforcement, multi-font rendering with configurable distribution, and 25+ data augmentation techniques simulating real-world document degradations including rotation, blur, noise, and scanner artifacts. We demonstrate the efficacy of our approach by generating a 600,000-sample word-segmented Kashmiri OCR dataset, which we release publicly on HuggingFace. This work provides a practical pathway for bringing low-resource languages into the era of vision-language AI models, and the tool is openly available for researchers and practitioners working with underserved writing systems worldwide.
对于资源匮乏的语言,光学字符识别(OCR)仍然是一项重大挑战,主要是因为缺乏大规模的标注训练数据集。像克什米尔语这样的语言,拥有大约700万使用者和复杂的波斯-阿拉伯书写系统,其中包括独特的标点符号,在Tesseract、TrOCR 和 PaddleOCR 等主要 OCR 系统中目前仍得不到支持。为这类语言创建手动数据集的成本高得令人难以承受,耗时且容易出错,并常常需要逐词转录印刷或手写文本。 我们提出了一种开源的合成 OCR 数据集生成器——SynthOCR-Gen,专门针对资源匮乏的语言设计。该工具通过将数字 Unicode 文本语料库转换为即用型训练数据集来解决 OCR 开发中的基本瓶颈问题。系统实现了一个全面的工作流程,包括文本分割(字符、单词、n-gram、句子和行级别)、Unicode 正规化以及强制实施书写系统的纯度,多字体渲染和支持配置的分布设置,以及 25 多种数据增强技术来模拟现实世界文档退化的多种情况,如旋转、模糊、噪声和扫描器产生的伪影。 我们通过生成一个包含60万样本的克什米尔语单词级 OCR 数据集来展示了这种方法的有效性,并将其公开发布在 HuggingFace 上。本工作为资源匮乏的语言进入视觉-语言 AI 模型时代提供了一条实用路径,且工具对全世界从事未得到充分服务的文字系统的研究人员和实践者完全开放使用。
https://arxiv.org/abs/2601.16113
Pixel-wise capabilities are essential for building interactive intelligent systems. However, pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To address these challenges, we present SAMTok, a discrete mask tokenizer that converts any region mask into two special tokens and reconstructs the mask using these tokens with high fidelity. By treating masks as new language tokens, SAMTok enables base MLLMs (such as the QwenVL series) to learn pixel-wise capabilities through standard next-token prediction and simple reinforcement learning, without architectural modifications and specialized loss design. SAMTok builds on SAM2 and is trained on 209M diverse masks using a mask encoder and residual vector quantizer to produce discrete, compact, and information-rich tokens. With 5M SAMTok-formatted mask understanding and generation data samples, QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation. We further introduce a textual answer-matching reward that enables efficient reinforcement learning for mask generation, delivering substantial improvements on GRES and GCG benchmarks. Our results demonstrate a scalable and straightforward paradigm for equipping MLLMs with strong pixel-wise capabilities. Our code and models are available.
像素级别的能力对于构建互动智能系统至关重要。然而,由于复杂的区域级编码器、专业的分割解码器和不兼容的训练目标,像素级别的多模态大模型(MLLMs)仍然难以扩展规模。为了解决这些挑战,我们提出了SAMTok,这是一种离散的掩码标记器,能够将任何区域掩码转换成两个特殊令牌,并使用这两个令牌以高保真度重建掩码。通过将掩码视为新的语言令牌,SAMTok使基础MLLM(如QwenVL系列)可以通过标准的下一个令牌预测和简单的强化学习来学习像素级别的能力,而无需进行架构修改或专门的损失设计。 基于SAM2,并使用一个掩码编码器和残差向量量化器对2.09亿个多样化的掩码进行训练,SAMTok能够生成离散、紧凑且信息丰富的令牌。通过500万个以SAMTok格式标记的理解与生成数据样本,QwenVL-SAMTok在区域描述、区域VQA(视觉问答)、基于参考的对话、指代分割、场景图解析以及多轮互动分割等任务上取得了当前最优或可比的结果。 我们进一步引入了一个文本答案匹配奖励机制,使掩码生成过程中的强化学习更加高效,在GRES和GCG基准测试中带来了显著改进。我们的结果表明,为MLLM提供强大的像素级别能力提供了一种可扩展且简单的方法。 我们的代码和模型已公开可用。
https://arxiv.org/abs/2601.16093
Foundation Models (FMs) have demonstrated strong generalization across diverse vision tasks. However, their deployment in federated settings is hindered by high computational demands, substantial communication overhead, and significant inference costs. We propose DSFedMed, a dual-scale federated framework that enables mutual knowledge distillation between a centralized foundation model and lightweight client models for medical image segmentation. To support knowledge distillation, a set of high-quality medical images is generated to replace real public datasets, and a learnability-guided sample selection strategy is proposed to enhance efficiency and effectiveness in dual-scale distillation. This mutual distillation enables the foundation model to transfer general knowledge to lightweight clients, while also incorporating client-specific insights to refine the foundation model. Evaluations on five medical imaging segmentation datasets show that DSFedMed achieves an average 2 percent improvement in Dice score while reducing communication costs and inference time by nearly 90 percent compared to existing federated foundation model baselines. These results demonstrate significant efficiency gains and scalability for resource-limited federated deployments.
基础模型(FMs)在各种视觉任务中展现出了强大的泛化能力。然而,它们在联邦环境中的部署由于计算需求高、通信开销大以及推理成本显著而受到阻碍。为此,我们提出了DSFedMed,这是一种双尺度联邦框架,该框架允许集中式基础模型与轻量级客户端模型之间进行相互知识蒸馏,以用于医学图像分割任务中。为了支持这种知识蒸馏过程,生成了一组高质量的医学图像来替代真实的公开数据集,并提出了一种基于可学习性引导的样本选择策略,以提高双尺度蒸馏中的效率和效果。该双向蒸馏方法使得基础模型能够将通用知识传递给轻量级客户端,同时也能吸收来自客户端的具体见解以优化自身。 在五个医学影像分割数据集上的评估表明,DSFedMed相较于现有的联邦基础模型基线方案,在Dice分数上平均提高了2%,并且减少了近90%的通信成本和推理时间。这些结果展示了资源受限环境下联邦部署的有效性提升与可扩展性的显著进步。
https://arxiv.org/abs/2601.16073
Deep learning has substantially advanced medical image segmentation, yet achieving robust generalization across diverse imaging modalities and anatomical structures remains a major challenge. A key contributor to this limitation lies in how existing architectures, ranging from CNNs to Transformers and their hybrids, primarily encode spatial information while overlooking frequency-domain representations that capture rich structural and textural cues. Although few recent studies have begun exploring spectral information at the feature level, supervision-level integration of frequency cues-crucial for fine-grained object localization-remains largely untapped. To this end, we propose Phi-SegNet, a CNN-based architecture that incorporates phase-aware information at both architectural and optimization levels. The network integrates Bi-Feature Mask Former (BFMF) modules that blend neighboring encoder features to reduce semantic gaps, and Reverse Fourier Attention (RFA) blocks that refine decoder outputs using phase-regularized features. A dedicated phase-aware loss aligns these features with structural priors, forming a closed feedback loop that emphasizes boundary precision. Evaluated on five public datasets spanning X-ray, US, histopathology, MRI, and colonoscopy, Phi-SegNet consistently achieved state-of-the-art performance, with an average relative improvement of 1.54+/-1.26% in IoU and 0.98+/-0.71% in F1-score over the next best-performing model. In cross-dataset generalization scenarios involving unseen datasets from the known domain, Phi-SegNet also exhibits robust and superior performance, highlighting its adaptability and modality-agnostic design. These findings demonstrate the potential of leveraging spectral priors in both feature representation and supervision, paving the way for generalized segmentation frameworks that excel in fine-grained object localization.
深度学习在医学图像分割领域取得了显著进展,然而,在不同成像模态和解剖结构之间实现稳健的泛化仍然是一个重大挑战。现有架构(从CNN到Transformer及其混合体)主要编码空间信息,而忽视了捕捉丰富结构和纹理线索的频域表示,这是导致这一限制的关键因素之一。虽然最近有一些研究开始探索特征级别的光谱信息,但在监督级别上融合频率线索——这对于精细目标定位至关重要——仍然很大程度上未被开发。 为此,我们提出Phi-SegNet,这是一种基于CNN的架构,在体系结构和优化层面都整合了相位感知信息。该网络集成了Bi-Feature Mask Former(BFMF)模块,用于融合相邻编码器特征以减少语义差距,并使用相位正则化特征来精炼解码器输出的Reverse Fourier Attention(RFA)块。 通过专门设计的相位感知损失函数将这些特征与结构先验对齐,形成了一个闭环反馈机制,强调了边界的精确性。在涵盖X射线、超声波、组织病理学、MRI和结肠镜检查等领域的五个公开数据集上进行了评估,Phi-SegNet始终取得了最先进的性能,在平均相对改进方面,相较于下一个最佳模型,IoU提高了1.54±1.26%,F1得分提高了0.98±0.71%。 在涉及来自已知域但未经训练的数据集的跨数据集泛化场景中,Phi-SegNet也表现出稳健且优越的表现,彰显了其适应性和模态无关设计。这些发现表明,在特征表示和监督方面利用光谱先验具有潜力,并为实现卓越精细目标定位能力的通用分割框架铺平道路。
https://arxiv.org/abs/2601.16064
Widely adopted medical image segmentation methods, although efficient, are primarily deterministic and remain poorly amenable to natural language prompts. Thus, they lack the capability to estimate multiple proposals, human interaction, and cross-modality adaptation. Recently, text-to-image diffusion models have shown potential to bridge the gap. However, training them from scratch requires a large dataset-a limitation for medical image segmentation. Furthermore, they are often limited to binary segmentation and cannot be conditioned on a natural language prompt. To this end, we propose a novel framework called ProGiDiff that leverages existing image generation models for medical image segmentation purposes. Specifically, we propose a ControlNet-style conditioning mechanism with a custom encoder, suitable for image conditioning, to steer a pre-trained diffusion model to output segmentation masks. It naturally extends to a multi-class setting simply by prompting the target organ. Our experiment on organ segmentation from CT images demonstrates strong performance compared to previous methods and could greatly benefit from an expert-in-the-loop setting to leverage multiple proposals. Importantly, we demonstrate that the learned conditioning mechanism can be easily transferred through low-rank, few-shot adaptation to segment MR images.
广泛采用的医学图像分割方法虽然高效,但主要是确定性的,并且对自然语言提示不太友好。因此,它们缺乏估计多种提案、人机交互和跨模态适应的能力。最近,文本到图像的扩散模型显示出弥合这一差距的潜力。然而,从头开始训练这些模型需要大量的数据集——这对医学图像分割来说是一个限制。此外,它们通常仅限于二值分割,并且不能以自然语言提示为条件进行操作。 为此,我们提出了一种称为ProGiDiff的新框架,该框架利用现有的图像生成模型来实现医学图像分割的目的。具体而言,我们提出了一个类似ControlNet的控制机制和一个自定义编码器,适用于图像条件化,可以引导预训练的扩散模型输出分割掩码。通过提示目标器官,它自然地扩展到了多类设置。 我们在CT图像上的器官分割实验中展示了与先前方法相比的强大性能,并且可以从“专家在循环”(expert-in-the-loop)设置中受益匪浅,以利用多种提案。重要的是,我们证明了学习到的控制机制可以通过低秩、少量样本适应轻松转移到对MR图像进行分割。 此框架和方法表明,在医学图像分割领域,通过采用先进的文本引导技术结合现有生成模型可以显著提升算法的能力与灵活性,尤其是在处理跨模态数据时展现出了巨大的潜力。
https://arxiv.org/abs/2601.16060
Neuron segmentation is the cornerstone of reconstructing comprehensive neuronal connectomes, which is essential for deciphering the functional organization of the brain. The irregular morphology and densely intertwined structures of neurons make this task particularly challenging. Prevailing CNN-based methods often fail to resolve ambiguous boundaries due to the lack of long-range context, whereas Transformer-based methods suffer from boundary imprecision caused by the loss of voxel-level details during patch partitioning. To address these limitations, we propose NeuroMamba, a multi-perspective framework that exploits the linear complexity of Mamba to enable patch-free global modeling and synergizes this with complementary local feature modeling, thereby efficiently capturing long-range dependencies while meticulously preserving fine-grained voxel details. Specifically, we design a channel-gated Boundary Discriminative Feature Extractor (BDFE) to enhance local morphological cues. Complementing this, we introduce the Spatial Continuous Feature Extractor (SCFE), which integrates a resolution-aware scanning mechanism into the Visual Mamba architecture to adaptively model global dependencies across varying data resolutions. Finally, a cross-modulation mechanism synergistically fuses these multi-perspective features. Our method demonstrates state-of-the-art performance across four public EM datasets, validating its exceptional adaptability to both anisotropic and isotropic resolutions. The source code will be made publicly available.
神经元分割是重建完整神经连接组的基础,这对于解析大脑的功能组织至关重要。然而,神经元不规则的形态和密集交织的结构使得这一任务格外具有挑战性。基于CNN的方法由于缺乏长距离上下文信息而难以解决模糊边界的问题,而基于Transformer的方法则因在划分补丁时丢失体素级别的细节而导致边界不够精确。为了克服这些限制,我们提出了NeuroMamba——一个多视角框架,它利用了Mamba的线性复杂度特性来实现无分割块的全局建模,并与互补的局部特征建模相结合,从而高效地捕捉长距离依赖关系同时精细地保留体素级别的细节。 具体来说,我们设计了一种通道门控边界区分特徵提取器(BDFE),以增强局部形态线索。与此相辅相成的是,我们引入了空间连续特征提取器(SCFE),它将分辨率感知扫描机制整合到视觉Mamba架构中,能够自适应地建模不同数据分辨率下的全局依赖关系。 最后,通过一种跨调制机制协同融合这些多视角特性。我们的方法在四个公开的电子显微镜(EM)数据集上展示了最先进的性能,并且验证了其对各向异性及各向同性分辨率的卓越适应能力。源代码将开源提供。
https://arxiv.org/abs/2601.15929
Recent advances in medical vision language models guide the learning of visual representations; however, this form of supervision is constrained by the availability of paired image text data, raising the question of whether robust radiology encoders can be learned without relying on language supervision. In this work, we introduce RadJEPA, a self-supervised framework built on a Joint Embedding Predictive Architecture that learns without language supervision. Pre-trained solely on unlabeled chest X-ray images, the model learns to predict latent representations of masked image regions. This predictive objective differs fundamentally from both image text pre-training and DINO-style self-distillation: rather than aligning global representations across views or modalities, RadJEPA explicitly models latent-space prediction. We evaluate the learned encoder on disease classification, semantic segmentation, and report generation tasks. Across benchmarks, RadJEPA achieves performance exceeding state-of-the-art approaches, including Rad-DINO.
近期的医学视觉语言模型的发展指导了视觉表示的学习;然而,这种形式的监督受限于配对图像文本数据的可用性,引发了是否可以不依赖语言监督来学习稳健的放射学编码器的问题。在本工作中,我们介绍了RadJEPA,这是一种基于联合嵌入预测架构构建的自监督框架,它可以在没有语言监督的情况下进行学习。该模型仅通过未标记的胸部X光图像预训练,学习对遮蔽图像区域的潜在表示进行预测。这种预测目标与图像文本预训练和DINO风格的自我蒸馏方法根本不同:RadJEPA不是跨视图或模态对齐全局表示,而是明确地建模潜在空间中的预测。 我们在疾病分类、语义分割和报告生成任务上评估了所学习到的编码器。在各个基准测试中,RadJEPA的表现超过了包括Rad-DINO在内的最先进方法。
https://arxiv.org/abs/2601.15891
Foundation models leverage large-scale pretraining to capture extensive knowledge, demonstrating generalization in a wide range of language tasks. By comparison, vision foundation models (VFMs) often exhibit uneven improvements across downstream tasks, despite substantial computational investment. We postulate that this limitation arises from a mismatch between pretraining objectives and the demands of downstream vision-and-imaging tasks. Pretraining strategies like masked image reconstruction or contrastive learning shape representations for tasks such as recovery of generic visual patterns or global semantic structures, which may not align with the task-specific requirements of downstream applications including segmentation, classification, or image synthesis. To investigate this in a concrete real-world clinical area, we assess two VFMs, a reconstruction-focused MAE-based model (ProFound) and a contrastive-learning-based model (ProViCNet), on five prostate multiparametric MR imaging tasks, examining how such task alignment influences transfer performance, i.e., from pretraining to fine-tuning. Our findings indicate that better alignment between pretraining and downstream tasks, measured by simple divergence metrics such as maximum-mean-discrepancy (MMD) between the same features before and after fine-tuning, correlates with greater performance improvements and faster convergence, emphasizing the importance of designing and analyzing pretraining objectives with downstream applicability in mind.
基础模型通过大规模的预训练来捕获广泛的知识,并在各种语言任务中表现出泛化能力。相比之下,尽管投入了大量计算资源,视觉基础模型(VFMs)在其下游任务中的改进往往不均衡。我们认为这种限制源于预训练目标与下游视觉和成像任务需求之间的不匹配。例如,掩码图像重建或对比学习等预训练策略塑造的是用于恢复通用视觉模式或全局语义结构的表示形式,这些可能并不符合包括分割、分类或图像合成在内的特定任务的需求。 为了在具体的现实世界临床领域探讨这一问题,我们评估了两个VFMs——一个侧重于重建的MAE基模型(ProFound)和一个基于对比学习的模型(ProViCNet),对五种前列腺多参数MR成像任务进行测试。我们检查了这种任务匹配如何影响迁移性能,即从预训练到微调的过程中的表现。 我们的研究结果表明,通过简单的分歧度量如最大均值差异(MMD)来衡量的预训练与下游任务之间的更好对齐,会导致更大的性能提升和更快的收敛速度。这强调了在设计和分析预训练目标时考虑其对下游应用的影响的重要性。
https://arxiv.org/abs/2601.15888
Neuron segmentation in electron microscopy (EM) aims to reconstruct the complete neuronal connectome; however, current deep learning-based methods are limited by their reliance on large-scale training data and extensive, time-consuming manual annotations. Traditional methods augment the training set through geometric and photometric transformations; however, the generated samples remain highly correlated with the original images and lack structural diversity. To address this limitation, we propose a diffusion-based data augmentation framework capable of generating diverse and structurally plausible image-label pairs for neuron segmentation. Specifically, the framework employs a resolution-aware conditional diffusion model with multi-scale conditioning and EM resolution priors to enable voxel-level image synthesis from 3D masks. It further incorporates a biology-guided mask remodeling module that produces augmented masks with enhanced structural realism. Together, these components effectively enrich the training set and improve segmentation performance. On the AC3 and AC4 datasets under low-annotation regimes, our method improves the ARAND metric by 32.1% and 30.7%, respectively, when combined with two different post-processing methods. Our code is available at this https URL.
电子显微镜(EM)中的神经元分割旨在重建完整的神经连接图;然而,目前基于深度学习的方法受到大规模训练数据和耗时的手动注释的限制。传统方法通过几何变换和光度变换扩充训练集,但生成的数据样本与原始图像高度相关,并且缺乏结构多样性。为了解决这一局限性,我们提出了一种基于扩散模型的数据增强框架,该框架能够生成多样化且在结构上合理的图象标签对用于神经元分割。具体来说,此框架采用了一个分辨率感知的条件扩散模型,结合多尺度调节和EM分辨率先验信息,使从3D掩码中合成像素级图像成为可能。此外,它还包含一个由生物学指导的掩模重塑模块,该模块能够生成结构真实性增强的增强掩模。这些组件共同有效地丰富了训练集,并提高了分割性能。在低标注环境下的AC3和AC4数据集中,与两种不同的后处理方法结合使用时,我们的方法分别使ARAND指标提升了32.1%和30.7%。代码可在提供的URL地址获取。 注意:原文中的URL因格式要求未能具体显示,请访问相关研究的发布页面或直接联系作者以获取准确的代码链接。
https://arxiv.org/abs/2601.15779
This paper presents FeTal-SAM, a novel adaptation of the Segment Anything Model (SAM) tailored for fetal brain MRI segmentation. Traditional deep learning methods often require large annotated datasets for a fixed set of labels, making them inflexible when clinical or research needs change. By integrating atlas-based prompts and foundation-model principles, FeTal-SAM addresses two key limitations in fetal brain MRI segmentation: (1) the need to retrain models for varying label definitions, and (2) the lack of insight into whether segmentations are driven by genuine image contrast or by learned spatial priors. We leverage multi-atlas registration to generate spatially aligned label templates that serve as dense prompts, alongside a bounding-box prompt, for SAM's segmentation decoder. This strategy enables binary segmentation on a per-structure basis, which is subsequently fused to reconstruct the full 3D segmentation volumes. Evaluations on two datasets, the dHCP dataset and an in-house dataset demonstrate FeTal-SAM's robust performance across gestational ages. Notably, it achieves Dice scores comparable to state-of-the-art baselines which were trained for each dataset and label definition for well-contrasted structures like cortical plate and cerebellum, while maintaining the flexibility to segment any user-specified anatomy. Although slightly lower accuracy is observed for subtle, low-contrast structures (e.g., hippocampus, amygdala), our results highlight FeTal-SAM's potential to serve as a general-purpose segmentation model without exhaustive retraining. This method thus constitutes a promising step toward clinically adaptable fetal brain MRI analysis tools.
本文介绍了FeTal-SAM,这是对分割任何模型(SAM)的一种创新改编版本,专门用于胎儿大脑MRI的分割。传统的深度学习方法通常需要大量标注的数据集来处理固定的标签集合,在临床或研究需求变化时显得不够灵活。通过整合基于图谱的提示和基础模型的原则,FeTal-SAM解决了胎儿大脑MRI分割中的两个关键限制:(1) 需要为不同的标签定义重新训练模型;以及 (2) 缺乏了解分割结果是由真正的图像对比度驱动还是由学习到的空间先验决定。我们利用多图谱配准技术生成与空间对齐的标签模板,这些模板作为密集提示与边界框提示一起用于SAM的分割解码器中。这一策略实现了基于每种结构进行二值分割,并随后融合以重构完整的3D分割体积。 在两个数据集(dHCP数据集和内部数据集)上的评估显示了FeTal-SAM在不同妊娠年龄下的稳健性能。值得注意的是,它对于具有良好对比度的结构如皮层板和小脑,在Dice评分上与为每个数据集和标签定义训练的状态-of-the-art基准线相当,同时保持对任何用户指定解剖结构进行分割的能力。尽管对于细微、低对比度的结构(例如海马体、杏仁核)准确率略低,但我们的结果突显了FeTal-SAM作为通用分割模型的巨大潜力,在无需详尽重新训练的情况下满足需求变化。 因此,该方法为临床适应性强的胎儿大脑MRI分析工具的发展提供了一个有前景的方向。
https://arxiv.org/abs/2601.15759
The successful adaptation of foundation models to multi-modal medical imaging is a critical yet unresolved challenge. Existing models often struggle to effectively fuse information from multiple sources and adapt to the heterogeneous nature of pathological tissues. To address this, we introduce a novel framework for adapting foundation models to multi-modal medical imaging, featuring two key technical innovations: sub-region-aware modality attention and adaptive prompt engineering. The attention mechanism enables the model to learn the optimal combination of modalities for each tumor sub-region, while the adaptive prompting strategy leverages the inherent capabilities of foundation models to refine segmentation accuracy. We validate our framework on the BraTS 2020 brain tumor segmentation dataset, demonstrating that our approach significantly outperforms baseline methods, particularly in the challenging necrotic core sub-region. Our work provides a principled and effective approach to multi-modal fusion and prompting, paving the way for more accurate and robust foundation model-based solutions in medical imaging.
将基础模型成功适应多模态医学成像是一项关键但尚未解决的挑战。现有的模型往往难以有效地融合来自多个来源的信息,并且无法适应病理组织异质性的特点。为了解决这些问题,我们提出了一种新的框架,用于将基础模型适应到多模态医学成像中,该框架具有两项关键技术创新:子区域感知模式注意机制和自适应提示工程。这种注意力机制使模型能够学习每一种肿瘤亚区的最佳模式组合,而自适应提示策略则利用了基础模型的内在能力来提高分割精度。我们在BraTS 2020脑肿瘤分割数据集上验证了我们的框架,并展示了我们的方法显著优于基线方法,特别是在具有挑战性的坏死核心亚区域中表现突出。我们的工作提供了一种基于原则和有效的方法来进行多模态融合与提示设计,为医学成像中的基础模型解决方案的准确性与鲁棒性提升铺平了道路。
https://arxiv.org/abs/2601.15734
This work focuses on national-scale land-use/land-cover (LULC) semantic segmentation using ALOS-2 single-polarization (HH) SAR data over Japan, together with a companion binary water detection task. Building on SAR-W-MixMAE self-supervised pretraining [1], we address common SAR dense-prediction failure modes, boundary over-smoothing, missed thin/slender structures, and rare-class degradation under long-tailed labels, without increasing pipeline complexity. We introduce three lightweight refinements: (i) injecting high-resolution features into multi-scale decoding, (ii) a progressive refine-up head that alternates convolutional refinement and stepwise upsampling, and (iii) an $\alpha$-scale factor that tempers class reweighting within a focal+dice objective. The resulting model yields consistent improvements on the Japan-wide ALOS-2 LULC benchmark, particularly for under-represented classes, and improves water detection across standard evaluation metrics.
这项工作专注于使用ALOS-2单极化(HH)SAR数据在日本进行国家尺度的土地利用/土地覆盖(LULC)语义分割,同时包括一个伴生的二元水体检测任务。基于SAR-W-MixMAE自监督预训练[1],我们解决了常见的SAR密集预测失败模式,如边界过度平滑、遗漏细长结构以及在长尾标签下罕见类别的性能下降问题,而无需增加管道复杂度。我们引入了三项轻量级改进:(i) 将高分辨率特征注入多尺度解码中;(ii) 一种逐步细化和上采样的交替进行的渐进式细化头部,以及 (iii) $\alpha$-缩放因子,用于调节在焦点+dice目标下的类重新加权。最终模型在日本全境ALOS-2 LULC基准测试中取得了持续性的改进,特别是在代表性不足的类别中,并且在标准评估指标下提高了水体检测性能。
https://arxiv.org/abs/2601.15705
Accurate segmentation of cervical structures in transvaginal ultrasound (TVS) is critical for assessing the risk of spontaneous preterm birth (PTB), yet the scarcity of labeled data limits the performance of supervised learning approaches. This paper introduces the Fetal Ultrasound Grand Challenge (FUGC), the first benchmark for semi-supervised learning in cervical segmentation, hosted at ISBI 2025. FUGC provides a dataset of 890 TVS images, including 500 training images, 90 validation images, and 300 test images. Methods were evaluated using the Dice Similarity Coefficient (DSC), Hausdorff Distance (HD), and runtime (RT), with a weighted combination of 0.4/0.4/0.2. The challenge attracted 10 teams with 82 participants submitting innovative solutions. The best-performing methods for each individual metric achieved 90.26\% mDSC, 38.88 mHD, and 32.85 ms RT, respectively. FUGC establishes a standardized benchmark for cervical segmentation, demonstrates the efficacy of semi-supervised methods with limited labeled data, and provides a foundation for AI-assisted clinical PTB risk assessment.
在经阴道超声(TVS)中准确分割宫颈结构对于评估自发性早产(PTB)的风险至关重要,然而标注数据的稀缺限制了监督学习方法的表现。本文介绍了“胎儿超声大挑战”(FUGC),这是ISBI 2025首次举办的半监督学习宫颈分割基准大赛。FUGC提供了一个包含890张TVS图像的数据集,其中包括500张训练图像、90张验证图像和300张测试图像。评价方法使用了Dice相似系数(DSC)、豪斯多夫距离(HD)以及运行时间(RT),并以0.4/0.4/0.2的权重进行综合评分。该挑战吸引了10支队伍共82名参与者提交创新解决方案,其中最佳性能方法在各项指标中分别取得了90.26% mDSC、38.88 mHD和32.85 ms RT的成绩。FUGC为宫颈分割建立了标准化的基准,证明了有限标注数据下半监督方法的有效性,并为进一步开发辅助临床PTB风险评估的人工智能工具奠定了基础。
https://arxiv.org/abs/2601.15572
The integration of Large Language Models (LLMs) into the public health policy sector offers a transformative approach to navigating the vast repositories of regulatory guidance maintained by agencies such as the Centers for Disease Control and Prevention (CDC). However, the propensity for LLMs to generate hallucinations, defined as plausible but factually incorrect assertions, presents a critical barrier to the adoption of these technologies in high-stakes environments where information integrity is non-negotiable. This empirical evaluation explores the effectiveness of Retrieval-Augmented Generation (RAG) architectures in mitigating these risks by grounding generative outputs in authoritative document context. Specifically, this study compares a baseline Vanilla LLM against Basic RAG and Advanced RAG pipelines utilizing cross-encoder re-ranking. The experimental framework employs a Mistral-7B-Instruct-v0.2 model and an all-MiniLM-L6-v2 embedding model to process a corpus of official CDC policy analytical frameworks and guidance documents. The analysis measures the impact of two distinct chunking strategies, recursive character-based and token-based semantic splitting, on system accuracy, measured through faithfulness and relevance scores across a curated set of complex policy scenarios. Quantitative findings indicate that while Basic RAG architectures provide a substantial improvement in faithfulness (0.621) over Vanilla baselines (0.347), the Advanced RAG configuration achieves a superior faithfulness average of 0.797. These results demonstrate that two-stage retrieval mechanisms are essential for achieving the precision required for domain-specific policy question answering, though structural constraints in document segmentation remain a significant bottleneck for multi-step reasoning tasks.
将大型语言模型(LLMs)融入公共卫生政策领域提供了一种应对庞大监管指导资料库的变革性方法,这些资料库由诸如疾病控制和预防中心(CDC)等机构维护。然而,LLMs产生幻觉的能力——即生成看似可信但实际上错误的信息——构成了在信息完整性不容妥协的高风险环境中采用这类技术的重要障碍。这项实证评估探讨了检索增强生成(RAG)架构在通过将生成输出锚定于权威文档上下文来缓解这些风险方面的有效性。具体来说,该研究比较了一个基准Vanilla LLM与使用跨编码器重排序的基本RAG和高级RAG管道的性能。 实验框架采用Mistral-7B-Instruct-v0.2模型和all-MiniLM-L6-v2嵌入模型来处理官方CDC政策分析框架和指导文件的语料库。该分析测量了两种不同的分块策略——递归字符基础分割和基于标记的语义分割,对系统准确性的影响力,通过精心挑选的一系列复杂政策场景中的忠实性和相关性评分进行衡量。 定量结果表明,虽然基本RAG架构在忠实度(0.621)上相对于Vanilla基准(0.347)提供了显著改进,但高级RAG配置实现了更优的平均忠实度为0.797。这些结果显示了两阶段检索机制对于实现特定领域政策问答所需的精度至关重要,尽管文档分段中的结构性限制仍然是多步推理任务的重要瓶颈。
https://arxiv.org/abs/2601.15457
Cervical spine fractures are critical medical conditions requiring precise and efficient detection for effective clinical management. This study explores the viability of 2D projection-based vertebra segmentation for vertebra-level fracture detection in 3D CT volumes, presenting an end-to-end pipeline for automated analysis of cervical vertebrae (C1-C7). By approximating a 3D volume through optimized 2D axial, sagittal, and coronal projections, regions of interest are identified using the YOLOv8 model from all views and combined to approximate the 3D cervical spine area, achieving a 3D mIoU of 94.45 percent. This projection-based localization strategy reduces computational complexity compared to traditional 3D segmentation methods while maintaining high performance. It is followed by a DenseNet121-Unet-based multi-label segmentation leveraging variance- and energy-based projections, achieving a Dice score of 87.86 percent. Strategic approximation of 3D vertebral masks from these 2D segmentation masks enables the extraction of individual vertebra volumes. The volumes are analyzed for fractures using an ensemble of 2.5D Spatio-Sequential models incorporating both raw slices and projections per vertebra for complementary evaluation. This ensemble achieves vertebra-level and patient-level F1 scores of 68.15 and 82.26, and ROC-AUC scores of 91.62 and 83.04, respectively. We further validate our approach through an explainability study that provides saliency map visualizations highlighting anatomical regions relevant for diagnosis, and an interobserver variability analysis comparing our model's performance with expert radiologists, demonstrating competitive results.
颈椎骨折是一种需要精确和高效检测以进行有效临床管理的严重医疗状况。本研究探讨了基于2D投影的椎骨分割在3D CT体积中进行椎体水平骨折检测的可行性,并提出了一种针对颈椎(C1-C7)自动化分析的端到端管道。通过优化的2D轴向、矢状面和冠状面投影来近似一个三维体积,使用YOLOv8模型从所有视角识别感兴趣区域并结合以近似3D颈椎区域,达到了94.45%的3D mIoU(平均交并比)。这种基于投影的位置策略与传统的3D分割方法相比,在降低计算复杂性的同时保持了高性能。随后,使用基于Variance和Energy的投影进行多标签分割,采用DenseNet121-Unet架构实现了87.86%的Dice分数。通过战略性地从这些2D分割掩码中近似出3D椎骨掩模,可以提取各个椎体体积,并对其进行骨折分析。 该过程使用一个集成了每个椎体原始切片和投影的2.5D空间序列模型集合进行互补评估,在椎体级别和患者级别的F1得分分别为68.15和82.26,在ROC-AUC分数中分别达到91.62和83.04。我们进一步通过可解释性研究验证了我们的方法,该研究表明了用于诊断的相关解剖区域的热图,并进行了观察者间变异分析,将模型性能与专家放射科医师进行比较,证明了其具有竞争力的结果。 总结:这项研究开发了一种新颖的方法来检测颈椎骨折,结合了2D和3D图像处理技术的优势。这种方法不仅提高了骨折检测的速度和准确性,而且还为医学影像分析提供了一个有前景的新方向。
https://arxiv.org/abs/2601.15235
Autonomous drone racing represents a major frontier in robotics research. It requires an Artificial Intelligence (AI) that can run on board light-weight flying robots under tight resource and time constraints, while pushing the physical system to its limits. The state of the art in this area consists of a system with a stereo camera and an inertial measurement unit (IMU) that beat human drone racing champions in a controlled indoor environment. Here, we present MonoRace: an onboard drone racing approach that uses a monocular, rolling-shutter camera and IMU that generalizes to a competition environment without any external motion tracking system. The approach features robust state estimation that combines neural-network-based gate segmentation with a drone model. Moreover, it includes an offline optimization procedure that leverages the known geometry of gates to refine any state estimation parameter. This offline optimization is based purely on onboard flight data and is important for fine-tuning the vital external camera calibration parameters. Furthermore, the guidance and control are performed by a neural network that foregoes inner loop controllers by directly sending motor commands. This small network runs on the flight controller at 500Hz. The proposed approach won the 2025 Abu Dhabi Autonomous Drone Racing Competition (A2RL), outperforming all competing AI teams and three human world champion pilots in a direct knockout tournament. It set a new milestone in autonomous drone racing research, reaching speeds up to 100 km/h on the competition track and successfully coping with problems such as camera interference and IMU saturation.
自主无人机竞速代表了机器人研究的一个重要前沿。它需要能够在资源和时间受限的条件下运行于轻量级飞行机器人上的人工智能,同时将物理系统推向极限。该领域的最新技术包括一个配备了立体摄像头和惯性测量单元(IMU)的系统,在受控室内环境中击败了人类无人机竞速冠军。在这里,我们介绍了MonoRace:一种使用单目滚动快门相机和IMU、适用于无外部运动跟踪系统的竞争环境中的机载无人机竞速方法。这种方法的特点是结合神经网络栅栏分割与无人机模型的稳健状态估计,并且包括一个离线优化过程,该过程利用已知的栅栏几何形状来微调任何状态估算参数。此离线优化完全基于飞行数据,对于精细调整重要的外部相机校准参数至关重要。 此外,引导和控制由神经网络执行,该网络通过直接发送电机命令而跳过了内部回路控制器。这个小型网络在飞行控制器上以每秒500赫兹的频率运行。所提出的方法赢得了2025年阿布扎比自主无人机竞速比赛(A2RL),超过了所有参赛的人工智能团队和三名世界冠军飞行员,后者在直接淘汰赛中被击败。它在自主无人机竞速研究方面设立了新的里程碑,在竞赛赛道上达到了高达100公里/小时的速度,并成功应对了诸如相机干扰和IMU饱和等问题。
https://arxiv.org/abs/2601.15222
Training deep computer vision models requires manual oversight or hyperparameter tuning of the learning rate (LR) schedule. While existing adaptive optimizers schedule the LR automatically, they suffer from computational and memory overhead, incompatibility with regularization, and suboptimal LR choices. In this work, we introduce the ZENITH (Zero-overhead Evolution using Norm-Informed Training History) optimizer, which adapts the LR using the temporal evolution of the gradient norm. Image classification experiments spanning 6 CNN architectures and 6 benchmarks demonstrate that ZENITH achieves higher test accuracy in lower wall-clock time than baselines. It also yielded superior mAP in object detection, keypoint detection, and instance segmentation on MS COCO using the R-CNN family of models. Furthermore, its compatibility with regularization enables even better generalization.
训练深度计算机视觉模型需要人工监督或调整学习率(LR)的调度。虽然现有的自适应优化器可以自动安排学习率,但它们会带来计算和内存开销、与正则化不兼容以及次优的学习率选择等问题。在本文中,我们引入了 ZENITH(Zero-overhead Evolution using Norm-Informed Training History,基于范数信息训练历史的零开销演化)优化器,该优化器通过梯度范数的时间演变来调整学习率。跨越6种CNN架构和6个基准的数据集进行的图像分类实验表明,ZENITH能够在更短的实际运行时间(wall-clock time)内比基线模型获得更高的测试准确率。此外,在使用R-CNN家族模型的情况下,它在MS COCO数据集上的目标检测、关键点检测和实例分割任务中取得了更好的mAP值。更重要的是,它的兼容性与正则化结合后还能进一步提升泛化能力。
https://arxiv.org/abs/2601.15212
Culverts and sewer pipes are critical components of drainage systems, and their failure can lead to serious risks to public safety and the environment. In this thesis, we explore methods to improve automated defect segmentation in culverts and sewer pipes. Collecting and annotating data in this field is cumbersome and requires domain knowledge. Having a large dataset for structural defect detection is therefore not feasible. Our proposed methods are tested under conditions with limited annotated data to demonstrate applicability to real-world scenarios. Overall, this thesis proposes three methods to significantly enhance defect segmentation and handle data scarcity. This can be addressed either by enhancing the training data or by adjusting a models architecture. First, we evaluate preprocessing strategies, including traditional data augmentation and dynamic label injection. These techniques significantly improve segmentation performance, increasing both Intersection over Union (IoU) and F1 score. Second, we introduce FORTRESS, a novel architecture that combines depthwise separable convolutions, adaptive Kolmogorov-Arnold Networks (KAN), and multi-scale attention mechanisms. FORTRESS achieves state-of-the-art performance on the culvert sewer pipe defect dataset, while significantly reducing the number of trainable parameters, as well as its computational cost. Finally, we investigate few-shot semantic segmentation and its applicability to defect detection. Few-shot learning aims to train models with only limited data available. By employing a bidirectional prototypical network with attention mechanisms, the model achieves richer feature representations and achieves satisfactory results across evaluation metrics.
涵洞和污水管是排水系统中的关键组成部分,它们的故障可能会导致严重的公共安全和环境风险。在这篇论文中,我们探讨了改进涵洞及污水管道缺陷自动分割方法的方法。在这一领域收集并标注数据既费时又需要专业知识,因此建立一个用于结构缺陷检测的大规模数据集是不现实的。我们的研究方法是在有限的标注数据条件下进行测试,以展示其应用于实际场景的可能性。 总的来说,本文提出了三种显著提升缺陷分割性能并在面对数据稀缺问题时有效应对的方法,这些方法可以通过增强训练数据或者调整模型架构来实现。首先,我们评估了预处理策略的效果,包括传统的数据扩充以及动态标签注入等技术,这些技术能够显著提高分割效果,增加了交并比(IoU)和F1分数的值。 其次,本文介绍了FORTRESS这一全新架构,它结合了深度可分离卷积、自适应科莫格罗夫-阿诺尔德网络(KAN)以及多尺度注意力机制。FORTRESS在涵洞污水管道缺陷数据集上取得了最先进的性能,同时显著减少了训练参数的数量及其计算成本。 最后,我们探讨了少量样本语义分割技术的应用性,并将其应用于缺陷检测领域中。少量样本学习的目标是使用仅有的有限数量的数据来训练模型。通过采用双向原型网络和注意力机制,该模型能够获取更为丰富的特征表示,并在评价指标上取得了令人满意的结果。
https://arxiv.org/abs/2601.15366
Despite tremendous improvements in tasks such as image classification, object detection, and segmentation, the recognition of visual relationships, commonly modeled as the extraction of a graph from an image, remains a challenging task. We believe that this mainly stems from the fact that there is no canonical way to approach the visual graph recognition task. Most existing solutions are specific to a problem and cannot be transferred between different contexts out-of-the box, even though the conceptual problem remains the same. With broad applicability and simplicity in mind, in this paper we develop a method, \textbf{Gra}ph Recognition via \textbf{S}ubgraph \textbf{P}rediction (\textbf{GraSP}), for recognizing graphs in images. We show across several synthetic benchmarks and one real-world application that our method works with a set of diverse types of graphs and their drawings, and can be transferred between tasks without task-specific modifications, paving the way to a more unified framework for visual graph recognition.
尽管在图像分类、目标检测和分割等任务上取得了巨大的进步,但识别视觉关系(通常建模为从图像中提取图形)仍然是一项具有挑战性的任务。我们认为这主要是由于目前没有一个标准的方式来处理视觉图识别任务。大多数现有的解决方案都是针对特定问题的,无法直接应用于不同的情境中,尽管概念性的问题是一样的。本着广泛适用性和简单性的原则,在本文中我们开发了一种方法——通过子图预测进行图形识别(GraSP),用于在图像中识别图形。我们在多个合成基准测试和一个现实世界的应用程序上展示了该方法可以处理多种不同类型的图及其绘制方式,并且可以在任务之间转移,而无需特定任务的修改,这为视觉图识别的一个更加统一框架铺平了道路。
https://arxiv.org/abs/2601.15133