This paper presents a new dataset for Novel View Synthesis, generated from a high-quality, animated film with stunning realism and intricate detail. Our dataset captures a variety of dynamic scenes, complete with detailed textures, lighting, and motion, making it ideal for training and evaluating cutting-edge 4D scene reconstruction and novel view generation models. In addition to high-fidelity RGB images, we provide multiple complementary modalities, including depth, surface normals, object segmentation and optical flow, enabling a deeper understanding of scene geometry and motion. The dataset is organised into three distinct benchmarking scenarios: a dense multi-view camera setup, a sparse camera arrangement, and monocular video sequences, enabling a wide range of experimentation and comparison across varying levels of data sparsity. With its combination of visual richness, high-quality annotations, and diverse experimental setups, this dataset offers a unique resource for pushing the boundaries of view synthesis and 3D vision.
本文介绍了一个新的用于新颖视图合成的数据集,该数据集源自一部具有惊人真实感和复杂细节的高质量动画电影。我们的数据集捕捉了各种动态场景,包括详细的纹理、光照和运动,非常适合训练和评估前沿的4D场景重建和新颖视图生成模型。除了高保真的RGB图像外,我们还提供了多种互补模式,如深度信息、表面法线、对象分割及光流,这有助于更深入地理解场景几何结构与运动。 数据集按三个不同的基准测试情景组织:密集多视角相机设置、稀疏相机排列以及单目视频序列。这使得在不同程度的数据稀疏性上进行广泛实验和对比成为可能。凭借其视觉丰富度、高质量注释及多样化的实验设定,该数据集为推进视图合成与3D视觉领域的边界提供了独特资源。
https://arxiv.org/abs/2512.13639
This work introduces {\it PrahokBART}, a compact pre-trained sequence-to-sequence model trained from scratch for Khmer using carefully curated Khmer and English corpora. We focus on improving the pre-training corpus quality and addressing the linguistic issues of Khmer, which are ignored in existing multilingual models, by incorporating linguistic components such as word segmentation and normalization. We evaluate PrahokBART on three generative tasks: machine translation, text summarization, and headline generation, where our results demonstrate that it outperforms mBART50, a strong multilingual pre-trained model. Additionally, our analysis provides insights into the impact of each linguistic module and evaluates how effectively our model handles space during text generation, which is crucial for the naturalness of texts in Khmer.
这项工作介绍了{\it PrahokBART},这是一种专门为高棉语从零开始训练的紧凑型预训练序列到序列模型,并使用精心整理的高棉语和英语语料库。我们专注于提高预训练语料库的质量并解决现有多语言模型忽略的高棉语的语言问题,通过引入诸如词分割和规范化等语言组件来实现这一目标。我们在三项生成任务上评估了PrahokBART:机器翻译、文本摘要和标题生成,在这些任务中,我们的结果表明它优于mBART50,这是一个强大的多语言预训练模型。此外,我们的分析提供了对每个语言模块影响的见解,并评估了在高棉语文本生成过程中我们模型处理空间的效果,这对提高文本自然度至关重要。
https://arxiv.org/abs/2512.13552
A single biomedical image can be meaningfully segmented in multiple ways, depending on the desired application. For instance, a brain MRI can be segmented according to tissue types, vascular territories, broad anatomical regions, fine-grained anatomy, or pathology, etc. Existing automatic segmentation models typically either (1) support only a single protocol, the one they were trained on, or (2) require labor-intensive manual prompting to specify the desired segmentation. We introduce Pancakes, a framework that, given a new image from a previously unseen domain, automatically generates multi-label segmentation maps for multiple plausible protocols, while maintaining semantic consistency across related images. Pancakes introduces a new problem formulation that is not currently attainable by existing foundation models. In a series of experiments on seven held-out datasets, we demonstrate that our model can significantly outperform existing foundation models in producing several plausible whole-image segmentations, that are semantically coherent across images.
一张生物医学图像可以根据所需的应用场景以多种方式进行有意义的分割。例如,大脑MRI可以依据组织类型、血管区域、广泛的解剖区域、精细的解剖结构或病理学等多个方面进行划分。现有的自动分割模型通常要么(1)仅支持单一协议,即它们所接受训练的那个特定方案;要么(2)需要费时的手动提示来指定所需的分割方法。 我们介绍了一种名为Pancakes的新框架,在给定来自以前未见过领域的新图像后,它可以自动生成针对多个合理协议的多标签分割图,并且在相关图像之间保持语义一致性。Pancakes引入了一个目前现有基础模型无法实现的问题设定。通过一系列关于七个独立测试数据集的实验,我们展示了我们的模型能够在生成多个合理的全图分割时显著超越现有的基础模型,在不同图像之间也具有良好的语义连贯性。
https://arxiv.org/abs/2512.13534
Generative foundation models contain broad visual knowledge and can produce diverse image variations, making them particularly promising for advancing domain generalization tasks. While they can be used for training data augmentation, synthesizing comprehensive target-domain variations remains slow, expensive, and incomplete. We propose an alternative: using diffusion models at test time to map target images back to the source distribution where the downstream model was trained. This approach requires only a source domain description, preserves the task model, and eliminates large-scale synthetic data generation. We demonstrate consistent improvements across segmentation, detection, and classification tasks under challenging environmental shifts in real-to-real domain generalization scenarios with unknown target distributions. Our analysis spans multiple generative and downstream models, including an ensemble variant for enhanced robustness. The method achieves substantial relative gains: 137% on BDD100K-Night, 68% on ImageNet-R, and 62% on DarkZurich.
生成式基础模型包含广泛的视觉知识,并能够产生多样化的图像变体,这使其在推进领域泛化任务方面尤为有前景。尽管它们可以用于训练数据增强,但在合成涵盖整个目标领域的多样化样本时仍然缓慢、昂贵且不完整。我们提出了一种替代方案:利用扩散模型在测试阶段将目标图像映射回源分布,在该分布下下游模型已进行过训练。这种策略只需要一个源领域描述,保留任务模型,并消除了大规模合成数据生成的需求。 我们在具有未知目标分布的真实到真实域泛化场景的分割、检测和分类任务中展示了持续改进的效果。我们的分析涵盖了多种生成式和下游模型,包括一种增强鲁棒性的集成变体。该方法实现了显著的相对收益:在BDD100K-Night上提高了137%,在ImageNet-R上提高了68%,在DarkZurich上提高了62%。
https://arxiv.org/abs/2512.13454
As the therapeutic target for Inflammatory Bowel Disease (IBD) shifts toward histologic remission, the accurate assessment of microscopic inflammation has become increasingly central for evaluating disease activity and response to treatment. In this work, we introduce IMILIA (Interpretable Multiple Instance Learning for Inflammation Analysis), an end-to-end framework designed for the prediction of inflammation presence in IBD digitized slides stained with hematoxylin and eosin (H&E), followed by the automated computation of markers characterizing tissue regions driving the predictions. IMILIA is composed of an inflammation prediction module, consisting of a Multiple Instance Learning (MIL) model, and an interpretability module, divided in two blocks: HistoPLUS, for cell instance detection, segmentation and classification; and EpiSeg, for epithelium segmentation. IMILIA achieves a cross-validation ROC-AUC of 0.83 on the discovery cohort, and a ROC-AUC of 0.99 and 0.84 on two external validation cohorts. The interpretability module yields biologically consistent insights: tiles with higher predicted scores show increased densities of immune cells (lymphocytes, plasmocytes, neutrophils and eosinophils), whereas lower-scored tiles predominantly contain normal epithelial cells. Notably, these patterns were consistent across all datasets. Code and models to partially replicate the results on the public IBDColEpi dataset can be found at this https URL.
随着炎症性肠病(IBD)的治疗目标转向组织学缓解,对微观炎症的准确评估已成为评价疾病活动性和治疗反应的关键。本文介绍了IMILIA(可解释的多重实例学习炎症分析),这是一种端到端框架,用于预测经苏木精和伊红(H&E)染色后数字化的IBD切片中的炎症存在,并随后自动计算表征驱动预测的组织区域标记物。 IMILIA由一个炎症预测模块组成,该模块包含一个多实例学习(MIL)模型,以及解释性模块,分为两个部分:HistoPLUS用于细胞实例检测、分割和分类;EpiSeg用于上皮层分割。在发现队列中,IMILIA实现了交叉验证ROC-AUC为0.83,在两个外部验证队列中的ROC-AUC分别为0.99和0.84。 解释性模块提供了生物学一致的见解:预测得分较高的切片显示出免疫细胞(淋巴细胞、浆细胞、中性粒细胞和嗜酸性粒细胞)密度增加,而得分较低的切片主要包含正常上皮细胞。值得注意的是,在所有数据集中这些模式是一致的。可以在以下网址找到用于部分复制在公开IBDColEpi数据集上的结果代码和模型:[此链接](https://this-URL.com)(请将"this https URL"替换为实际的代码和模型存放地址)。
https://arxiv.org/abs/2512.13440
This paper introduces a novel pipeline for generating large-scale, highly realistic, and automatically labeled datasets for computer vision tasks in robotic environments. Our approach addresses the critical challenges of the domain gap between synthetic and real-world imagery and the time-consuming bottleneck of manual annotation. We leverage 3D Gaussian Splatting (3DGS) to create photorealistic representations of the operational environment and objects. These assets are then used in a game engine where physics simulations create natural arrangements. A novel, two-pass rendering technique combines the realism of splats with a shadow map generated from proxy meshes. This map is then algorithmically composited with the image to add both physically plausible shadows and subtle highlights, significantly enhancing realism. Pixel-perfect segmentation masks are generated automatically and formatted for direct use with object detection models like YOLO. Our experiments show that a hybrid training strategy, combining a small set of real images with a large volume of our synthetic data, yields the best detection and segmentation performance, confirming this as an optimal strategy for efficiently achieving robust and accurate models.
这篇论文介绍了一种新颖的管道,用于生成大规模、高度逼真且自动标注的数据集,适用于机器人环境中计算机视觉任务。我们的方法解决了合成图像和现实世界图像之间的领域差距以及手动注释耗时瓶颈的关键挑战。我们利用3D高斯点阵法(3DGaussian Splatting,简称3DGS)来创建操作环境及其对象的逼真表示。然后在游戏引擎中使用这些资产,并通过物理模拟产生自然排列。一种新颖的两步渲染技术将点阵的真实感与从代理网格生成的阴影图相结合。该地图随后被算法合成到图像中,以添加具有物理学依据的阴影和微妙高光,显著增强了逼真度。像素级精确分割掩码会自动生成,并格式化为直接用于YOLO等目标检测模型。 我们的实验表明,结合少量真实图片与大量合成数据的混合训练策略能够实现最佳的目标检测和分割性能,这证实了这种策略是高效获得强大且准确模型的最佳途径。
https://arxiv.org/abs/2512.13411
Purpose: Intraoperative navigation in spine surgery demands millimeter-level accuracy. Current systems based on intraoperative radiographic imaging and bone-anchored markers are invasive, radiation-intensive and workflow disruptive. Recent markerless RGB-D registration methods offer a promising alternative, but existing approaches rely on weak segmentation labels to isolate relevant anatomical structures, which can propagate errors throughout registration. Methods: We present End2Reg an end-to-end deep learning framework that jointly optimizes segmentation and registration, eliminating the need for weak segmentation labels and manual steps. The network learns segmentation masks specifically optimized for registration, guided solely by the registration objective without direct segmentation supervision. Results: The proposed framework achieves state-of-the-art performance on ex- and in-vivo benchmarks, reducing median Target Registration Error by 32% to 1.83mm and mean Root Mean Square Error by 45% to 3.95mm, respectively. An ablation study confirms that end-to-end optimization significantly improves registration accuracy. Conclusion: The presented end-to-end RGB-D registration pipeline removes dependency on weak labels and manual steps, advancing towards fully automatic, markerless intraoperative navigation. Code and interactive visualizations are available at: this https URL.
翻译: 目的:脊柱手术中的术中导航需要毫米级别的精度。目前基于术中放射成像和骨锚定标记的系统具有侵入性强、辐射量大且干扰工作流程的特点。最近无标记的RGB-D(红绿蓝-深度)配准方法提供了一个有前景的替代方案,但现有的方法依赖于弱分割标签来分离相关的解剖结构,这可能会在整个配准过程中传播错误。 方法:我们提出了End2Reg——一个端到端的深度学习框架,该框架同时优化了分割和配准过程,从而消除了对弱分割标签及手动步骤的需求。网络通过仅采用注册目标指导,而无需直接分割监督,来学习特定于注册需求的分割掩码。 结果:所提出的框架在体外和体内基准测试中实现了最先进的性能,在减少平均靶向注册误差(TRE)方面达到了32%,至1.83毫米,并将均方根误差(RMS)减少了45%到3.95毫米。一个消融研究表明,端到端的优化显著提高了配准精度。 结论:本文介绍的无标记RGB-D配准流水线消除了对弱标签及手动步骤的依赖,朝着全自动化、无需标记的术中导航方向迈进了一步。代码和交互式可视化可以在以下链接获得:this https URL.
https://arxiv.org/abs/2512.13402
Accurately predicting topologically correct masks remains a difficult task for general segmentation models, which often produce fragmented or disconnected outputs. Fixing these artifacts typically requires hand-crafted refinement rules or architectures specialized to a particular task. Here, we show that Neural Cellular Automata (NCA) can be directly re-purposed as an effective refinement mechanism, using local, iterative updates guided by image context to repair segmentation masks. By training on imperfect masks and ground truths, the automaton learns the structural properties of the target shape while relying solely on local information. When applied to coarse, globally predicted masks, the learned dynamics progressively reconnect broken regions, prune loose fragments and converge towards stable, topologically consistent results. We show how refinement NCA (rNCA) can be easily applied to repair common topological errors produced by different base segmentation models and tasks: for fragmented retinal vessels, it yields 2-3% gains in Dice/clDice and improves Betti errors, reducing $\beta_0$ errors by 60% and $\beta_1$ by 20%; for myocardium, it repairs 61.5% of broken cases in a zero-shot setting while lowering ASSD and HD by 19% and 16%, respectively. This showcases NCA as effective and broadly applicable refiners.
准确预测拓扑正确的遮罩对于一般的分割模型来说仍然是一项困难的任务,这些模型常常会产生碎片化或不连续的输出。修复这些问题通常需要手工设计的细化规则或者专门针对特定任务优化的架构。本文展示了神经细胞自动机(Neural Cellular Automata, NCA)可以直接重新用于有效的细化机制,在图像上下文的引导下进行局部迭代更新以修复分割遮罩。通过使用不完美的遮罩和真实标签进行训练,该自动化系统能够学习目标形状的结构特性,并且仅依赖于局部信息。当应用于粗略的全局预测遮罩时,所学得的动力学会逐步连接断开的部分,修剪松散的碎片,并向稳定、拓扑一致的结果收敛。 我们展示了如何通过细化NCA(rNCA)来轻松修复由不同基础分割模型和任务产生的常见拓扑错误:对于片段化的视网膜血管,它能带来2-3%的Dice/clDice得分提升并改善Betti误差,减少$\beta_0$误差60%,$\beta_1$误差降低20%;对于心肌组织,在零样本设置下修复了61.5%的断裂情况,并降低了ASSD和HD分别达19%和16%。这展示了NCA作为一种有效且广泛适用的细化工具的能力。 该研究强调了神经细胞自动机在改进分割模型输出中的潜力,特别是对于那些需要拓扑一致性的问题而言。
https://arxiv.org/abs/2512.13397
Automatic polyp segmentation is crucial for improving the clinical identification of colorectal cancer (CRC). While Deep Learning (DL) techniques have been extensively researched for this problem, current methods frequently struggle with generalization, particularly in data-constrained or challenging settings. Moreover, many existing polyp segmentation methods rely on complex, task-specific architectures. To address these limitations, we present a framework that leverages the intrinsic robustness of DINO self-attention "key" features for robust segmentation. Unlike traditional methods that extract tokens from the deepest layers of the Vision Transformer (ViT), our approach leverages the key features of the self-attention module with a simple convolutional decoder to predict polyp masks, resulting in enhanced performance and better generalizability. We validate our approach using a multi-center dataset under two rigorous protocols: Domain Generalization (DG) and Extreme Single Domain Generalization (ESDG). Our results, supported by a comprehensive statistical analysis, demonstrate that this pipeline achieves state-of-the-art (SOTA) performance, significantly enhancing generalization, particularly in data-scarce and challenging scenarios. While avoiding a polyp-specific architecture, we surpass well-established models like nnU-Net and UM-Net. Additionally, we provide a systematic benchmark of the DINO framework's evolution, quantifying the specific impact of architectural advancements on downstream polyp segmentation performance.
自动息肉分割对于提高结直肠癌(CRC)的临床识别至关重要。尽管深度学习(DL)技术已经广泛应用于解决这个问题,但目前的方法在数据受限或挑战性环境中往往难以实现泛化效果。此外,许多现有的息肉分割方法依赖于复杂且特定任务的设计架构。为了克服这些限制,我们提出了一种框架,该框架利用DINO自注意力机制“键”特征的内在鲁棒性来进行稳健的分割。与传统的从视觉变换器(ViT)最深层提取令牌的方法不同,我们的方法使用简单的卷积解码器结合自注意力模块的键特征来预测息肉掩模,从而提高性能并实现更好的泛化能力。 我们通过一个多中心数据集在两个严格的协议下验证了这一方法:领域泛化(DG)和极端单一领域泛化(ESDG)。我们的结果,经全面统计分析支持,显示该管道达到了最先进的(SOTA)性能,在数据稀缺和挑战性场景中显著增强了泛化能力。同时避免使用特定于息肉的架构设计,我们超越了诸如nnU-Net和UM-Net等已确立的模型。 此外,我们还系统地评估了DINO框架的发展历程,并量化了架构改进对下游息肉分割性能的具体影响。
https://arxiv.org/abs/2512.13376
Semantic segmentation requires a holistic understanding of the physical world, as it assigns semantic labels to spatially continuous and structurally coherent objects rather than to isolated pixels. However, existing data-free knowledge distillation (DFKD) methods-primarily designed for classification-often disregard this continuity, resulting in significant performance degradation when applied directly to segmentation tasks. In this paper, we introduce DFSS, a novel data-free distillation framework tailored for semantic segmentation. Unlike prior approaches that treat pixels independently, DFSS respects the structural and contextual continuity of real-world scenes. Our key insight is to leverage Batch Normalization (BN) statistics from a teacher model to guide Approximate Distribution Sampling (ADS), enabling the selection of data that better reflects the original training distribution-without relying on potentially misleading teacher predictions. Additionally, we propose Weighted Distribution Progressive Distillation (WDPD), which dynamically prioritizes reliable samples that are more closely aligned with the original data distribution early in training and gradually incorporates more challenging cases, mirroring the natural progression of learning in human perception. Extensive experiments on standard benchmarks demonstrate that DFSS consistently outperforms existing data-free distillation methods for semantic segmentation, achieving state-of-the-art results with significantly reduced reliance on auxiliary data.
语义分割需要对物理世界有一个整体的理解,因为它为连续且结构连贯的对象分配语义标签,而不仅仅是孤立的像素。然而,现有的无数据知识蒸馏(DFKD)方法主要针对分类设计,常常忽视这种连续性,在直接应用于分割任务时会导致性能显著下降。在本文中,我们介绍了 DFSS,一种专为语义分割定制的新颖无数据蒸馏框架。与之前将像素独立处理的方法不同,DFSS 尊重现实场景中的结构和上下文连贯性。我们的核心见解是利用教师模型的批量归一化(BN)统计信息来指导近似分布采样(ADS),从而选择更好地反映原始训练分布的数据——而不依赖可能具有误导性的教师预测。 此外,我们提出了加权分布渐进式蒸馏(WDPD),该方法在训练初期优先考虑与原始数据分布更加一致的可靠样本,并逐渐纳入更具挑战性的情况,这反映了人类感知学习过程中的自然进展。在标准基准测试上的广泛实验表明,DFSS 一贯优于现有的无数据知识蒸馏方法,在语义分割中实现了最先进的结果,并且显著减少了对辅助数据的依赖。
https://arxiv.org/abs/2512.13175
The development of clinical-grade artificial intelligence in pathology is limited by the scarcity of diverse, high-quality annotated datasets. Generative models offer a potential solution but suffer from semantic instability and morphological hallucinations that compromise diagnostic reliability. To address this challenge, we introduce a Correlation-Regulated Alignment Framework for Tissue Synthesis (CRAFTS), the first generative foundation model for pathology-specific text-to-image synthesis. By leveraging a dual-stage training strategy on approximately 2.8 million image-caption pairs, CRAFTS incorporates a novel alignment mechanism that suppresses semantic drift to ensure biological accuracy. This model generates diverse pathological images spanning 30 cancer types, with quality rigorously validated by objective metrics and pathologist evaluations. Furthermore, CRAFTS-augmented datasets enhance the performance across various clinical tasks, including classification, cross-modal retrieval, self-supervised learning, and visual question answering. In addition, coupling CRAFTS with ControlNet enables precise control over tissue architecture from inputs such as nuclear segmentation masks and fluorescence images. By overcoming the critical barriers of data scarcity and privacy concerns, CRAFTS provides a limitless source of diverse, annotated histology data, effectively unlocking the creation of robust diagnostic tools for rare and complex cancer phenotypes.
在病理学中,临床级人工智能的发展受到多样性和高质量标注数据稀缺的限制。生成模型提供了一种潜在解决方案,但它们存在的语义不稳定和形态幻觉问题影响了诊断可靠性。为了解决这一挑战,我们引入了一个名为组织合成相关性调节对齐框架(CRAFTS)的新模型,这是第一个专门针对病理学特定文本到图像合成任务的生成基础模型。 通过在大约280万个图像-描述配对数据上采用双阶段训练策略,CRAFTS结合了一种新的对齐机制,该机制能够抑制语义漂移以确保生物准确性。此模型可以产生涵盖30种癌症类型的多样病理图像,并且这些图像的质量已经过客观指标和病理学家评估的严格验证。 此外,使用增强后的数据集(通过CRAFTS),各种临床任务的表现得到了提高,包括分类、跨模态检索、自监督学习以及视觉问答。另外,将CRAFTS与ControlNet结合可以实现对组织结构的精确控制,例如可以通过细胞核分割掩码和荧光图像等输入进行调整。 通过克服数据稀缺性和隐私问题这两个关键障碍,CRAFTS为生成多样且标注丰富的组织学数据提供了一个无限来源,从而有效解锁了针对罕见和复杂癌症表型的稳健诊断工具的开发。
https://arxiv.org/abs/2512.13164
Vision foundation models have demonstrated strong generalization in medical image segmentation by leveraging large-scale, heterogeneous pretraining. However, they often struggle to generalize to specialized clinical tasks under limited annotations or rare pathological variations, due to a mismatch between general priors and task-specific requirements. To address this, we propose Uncertainty-informed Collaborative Learning (UnCoL), a dual-teacher framework that harmonizes generalization and specialization in semi-supervised medical image segmentation. Specifically, UnCoL distills both visual and semantic representations from a frozen foundation model to transfer general knowledge, while concurrently maintaining a progressively adapting teacher to capture fine-grained and task-specific representations. To balance guidance from both teachers, pseudo-label learning in UnCoL is adaptively regulated by predictive uncertainty, which selectively suppresses unreliable supervision and stabilizes learning in ambiguous regions. Experiments on diverse 2D and 3D segmentation benchmarks show that UnCoL consistently outperforms state-of-the-art semi-supervised methods and foundation model baselines. Moreover, our model delivers near fully supervised performance with markedly reduced annotation requirements.
视觉基础模型通过大规模、异构的预训练在医学图像分割中展示了强大的泛化能力。然而,由于通用先验与特定任务需求之间的不匹配,在标注数据有限或罕见病理变化的情况下,它们往往难以将这种泛化能力应用于专门化的临床任务。为了解决这个问题,我们提出了不确定性引导协作学习(Uncertainty-informed Collaborative Learning, UnCoL),这是一种双教师框架,旨在协调半监督医学图像分割中的泛化和专业化。具体来说,UnCoL从冻结的基础模型中蒸馏出视觉和语义表示,以转移通用知识,同时维持一个逐步适应的教师来捕捉细粒度和特定任务的表示。为了平衡两个教师提供的指导,在UnCoL的伪标签学习过程中,通过预测不确定性自适应调节,选择性地抑制不可靠的监督,并稳定在模糊区域的学习。 在各种2D和3D分割基准测试中进行的实验表明,UnCoL始终优于最先进的半监督方法和基础模型基线。此外,我们的模型能够在显著减少标注需求的情况下接近完全监督性能。
https://arxiv.org/abs/2512.13101
Given the inherently costly and time-intensive nature of pixel-level annotation, the generation of synthetic datasets comprising sufficiently diverse synthetic images paired with ground-truth pixel-level annotations has garnered increasing attention recently for training high-performance semantic segmentation models. However, existing methods necessitate to either predict pseudo annotations after image generation or generate images conditioned on manual annotation masks, which incurs image-annotation semantic inconsistency or scalability problem. To migrate both problems with one stone, we present a novel dataset generative diffusion framework for semantic segmentation, termed JoDiffusion. Firstly, given a standard latent diffusion model, JoDiffusion incorporates an independent annotation variational auto-encoder (VAE) network to map annotation masks into the latent space shared by images. Then, the diffusion model is tailored to capture the joint distribution of each image and its annotation mask conditioned on a text prompt. By doing these, JoDiffusion enables simultaneously generating paired images and semantically consistent annotation masks solely conditioned on text prompts, thereby demonstrating superior scalability. Additionally, a mask optimization strategy is developed to mitigate the annotation noise produced during generation. Experiments on Pascal VOC, COCO, and ADE20K datasets show that the annotated dataset generated by JoDiffusion yields substantial performance improvements in semantic segmentation compared to existing methods.
鉴于像素级标注固有的高成本和耗时特性,近期越来越多的研究关注于生成包含足够多样化合成图像及其真实像素级标注的合成数据集,以训练高性能的语义分割模型。然而,现有方法要么在图像生成后预测伪标签,要么根据手动注释掩码生成图像,这会导致图像-标注语义不一致或可扩展性问题。为了同时解决这两个问题,我们提出了一种新的针对语义分割的数据集生成扩散框架,名为JoDiffusion。 首先,给定标准的潜在扩散模型,JoDiffusion整合了一个独立的注释变分自动编码器(VAE)网络,将注释掩码映射到与图像共享的潜在空间中。然后,该扩散模型被定制以捕捉每个图像及其标注掩码在文本提示条件下的联合分布。通过这种方式,JoDiffusion能够仅根据文本提示同时生成配对的图像和语义一致性的标注掩码,从而表现出卓越的可扩展性。 此外,我们开发了一种掩码优化策略来减轻生成过程中的注释噪声。在Pascal VOC、COCO和ADE20K数据集上的实验表明,JoDiffusion生成的标注数据集相比现有方法,在语义分割性能上取得了显著提升。
https://arxiv.org/abs/2512.13014
Accurate medical image analysis can greatly assist clinical diagnosis, but its effectiveness relies on high-quality expert annotations Obtaining pixel-level labels for medical images, particularly fundus images, remains costly and time-consuming. Meanwhile, despite the success of deep learning in medical imaging, the lack of interpretability limits its clinical adoption. To address these challenges, we propose TWLR, a two-stage framework for interpretable diabetic retinopathy (DR) assessment. In the first stage, a vision-language model integrates domain-specific ophthalmological knowledge into text embeddings to jointly perform DR grading and lesion classification, effectively linking semantic medical concepts with visual features. The second stage introduces an iterative severity regression framework based on weakly-supervised semantic segmentation. Lesion saliency maps generated through iterative refinement direct a progressive inpainting mechanism that systematically eliminates pathological features, effectively downgrading disease severity toward healthier fundus appearances. Critically, this severity regression approach achieves dual benefits: accurate lesion localization without pixel-level supervision and providing an interpretable visualization of disease-to-healthy transformations. Experimental results on the FGADR, DDR, and a private dataset demonstrate that TWLR achieves competitive performance in both DR classification and lesion segmentation, offering a more explainable and annotation-efficient solution for automated retinal image analysis.
准确的医学影像分析能够极大地辅助临床诊断,但其有效性依赖于高质量的专业标注。获取医学图像(尤其是眼底图像)的像素级标签既耗时又昂贵。尽管深度学习在医学成像领域的成功已经取得突破,但缺乏可解释性限制了其临床应用。为了解决这些问题,我们提出了TWLR,这是一种用于可解释的糖尿病视网膜病变(DR)评估的两阶段框架。 第一阶段中,视觉-语言模型整合特定领域的专业眼科知识到文本嵌入中,从而同时执行DR分级和病变分类。这有效地将语义医学概念与视觉特征联系起来。 第二阶段引入了一个基于弱监督语义分割的迭代严重程度回归框架。通过迭代细化生成的病变显著图引导渐进式填充机制,系统地消除病理特征,使疾病严重程度逐步向更健康的眼底外观退化。关键在于这一严重度回归方法实现了双重效果:在没有像素级监督的情况下实现准确的病变定位,并提供可解释的疾病到健康转变可视化。 实验结果表明,在FGADR、DDR以及一个私有数据集上,TWLR在DR分类和病变分割方面均达到了竞争性的性能,为自动视网膜图像分析提供了更加可解释且标注效率更高的解决方案。
https://arxiv.org/abs/2512.13008
Accurately mapping legal terminology across languages remains a significant challenge, especially for language pairs like Chinese and Japanese, which share a large number of homographs with different meanings. Existing resources and standardized tools for these languages are limited. To address this, we propose a human-AI collaborative approach for building a multilingual legal terminology database, based on a multi-agent framework. This approach integrates advanced large language models and legal domain experts throughout the entire process-from raw document preprocessing, article-level alignment, to terminology extraction, mapping, and quality assurance. Unlike a single automated pipeline, our approach places greater emphasis on how human experts participate in this multi-agent system. Humans and AI agents take on different roles: AI agents handle specific, repetitive tasks, such as OCR, text segmentation, semantic alignment, and initial terminology extraction, while human experts provide crucial oversight, review, and supervise the outputs with contextual knowledge and legal judgment. We tested the effectiveness of this framework using a trilingual parallel corpus comprising 35 key Chinese statutes, along with their English and Japanese translations. The experimental results show that this human-in-the-loop, multi-agent workflow not only improves the precision and consistency of multilingual legal terminology mapping but also offers greater scalability compared to traditional manual methods.
在不同语言之间准确地映射法律术语仍然是一个重大挑战,尤其是在像中文和日语这样的语言对中,这些语言共享大量具有不同含义的同形词。目前针对这些语言的资源和标准化工具非常有限。为了应对这一问题,我们提出了一种基于多代理框架的人机协作方法来构建一个多语言法律术语数据库。该方法在整个过程中整合了先进的大型语言模型与法律领域专家:从原始文档预处理、文章级别对齐到术语提取、映射及质量保证等各个环节。 不同于单一自动化流程,我们的方法更加重视人类专家在这个多代理系统中的参与方式。在这一框架中,人工智能(AI)代理和人类专家各自承担不同的角色:AI代理负责具体的、重复性的任务,例如光学字符识别(OCR)、文本分段、语义对齐及初步的术语提取等;而人类专家则提供关键性审查、监督,并利用上下文知识和法律判断来评估输出的质量。 为了验证这一框架的有效性,我们使用了一个包含35个重要中国法规及其英日双语文本的三语言平行语料库进行了实验。实验结果表明,这种人机协同、多代理工作流程不仅提高了多语言法律术语映射的准确性和一致性,还与传统手动方法相比具有更大的可扩展性。
https://arxiv.org/abs/2512.12950
Building extraction from remote sensing images is a challenging task due to the complex structure variations of the buildings. Existing methods employ convolutional or self-attention blocks to capture the multi-scale features in the segmentation models, while the inherent gap of the feature pyramids and insufficient global-local feature integration leads to inaccurate, ambiguous extraction results. To address this issue, in this paper, we present an Uncertainty-Aggregated Global-Local Fusion Network (UAGLNet), which is capable to exploit high-quality global-local visual semantics under the guidance of uncertainty modeling. Specifically, we propose a novel cooperative encoder, which adopts hybrid CNN and transformer layers at different stages to capture the local and global visual semantics, respectively. An intermediate cooperative interaction block (CIB) is designed to narrow the gap between the local and global features when the network becomes deeper. Afterwards, we propose a Global-Local Fusion (GLF) module to complementarily fuse the global and local representations. Moreover, to mitigate the segmentation ambiguity in uncertain regions, we propose an Uncertainty-Aggregated Decoder (UAD) to explicitly estimate the pixel-wise uncertainty to enhance the segmentation accuracy. Extensive experiments demonstrate that our method achieves superior performance to other state-of-the-art methods. Our code is available at this https URL
从遥感图像中提取建筑是一个具有挑战性的任务,因为建筑物的结构变化复杂。现有的方法通常采用卷积或自注意力模块来捕捉分割模型中的多尺度特征,但由于特征金字塔本身的差距和全局-局部特征融合不足,导致了不准确且模棱两可的结果。为了应对这一问题,在本文中,我们提出了一种不确定性聚合的全局-局部融合网络(UAGLNet),该网络能够在不确定性的引导下利用高质量的全局-局部视觉语义。 具体来说,我们提出了一个新颖的合作编码器,采用混合CNN和Transformer层在不同阶段分别捕获局部和全局的视觉语义。设计了一个中间合作交互模块(CIB)来缩小深层网络中的局部特征与全球特征之间的差距。随后,提出了一种全局-局部融合(GLF)模块以互补的方式融合全局和局部表示。此外,为了减轻不确定区域中分割结果的模糊性,我们提出了一个不确定性聚合解码器(UAD),用来显式地估计像素级别的不确定性,从而增强分割精度。 广泛的实验表明,我们的方法在性能上超越了其他最先进的方法。我们的代码可以在提供的链接处获取(原文中的具体链接请查看原文)。
https://arxiv.org/abs/2512.12941
Accurate thyroid nodule segmentation in ultrasound images is critical for diagnosis and treatment planning. However, ambiguous boundaries between nodules and surrounding tissues, size variations, and the scarcity of annotated ultrasound data pose significant challenges for automated segmentation. Existing deep learning models struggle to incorporate contextual information from the thyroid gland and generalize effectively across diverse cases. To address these challenges, we propose SSMT-Net, a Semi-Supervised Multi-Task Transformer-based Network that leverages unlabeled data to enhance Transformer-centric encoder feature extraction capability in an initial unsupervised phase. In the supervised phase, the model jointly optimizes nodule segmentation, gland segmentation, and nodule size estimation, integrating both local and global contextual features. Extensive evaluations on the TN3K and DDTI datasets demonstrate that SSMT-Net outperforms state-of-the-art methods, with higher accuracy and robustness, indicating its potential for real-world clinical applications.
在超声图像中对甲状腺结节进行精确分割对于诊断和治疗计划至关重要。然而,结节与其周围组织之间的模糊边界、大小变化以及注释超声数据的稀缺性给自动分割带来了重大挑战。现有的深度学习模型难以整合来自整个甲状腺的信息,并且难以在不同病例之间有效地推广。为了解决这些挑战,我们提出了一种基于半监督多任务Transformer网络(SSMT-Net)的方法。该方法利用未标注的数据,在初步的无监督阶段提升以Transformer为中心的编码器特征提取能力。进入有监督阶段后,模型同时优化结节分割、腺体分割和结节大小估计,融合局部和全局上下文特征。在TN3K和DDTI数据集上的广泛评估表明,SSMT-Net超越了现有的最先进的方法,在准确性和鲁棒性方面表现更佳,这预示着其潜在的临床应用价值。
https://arxiv.org/abs/2512.12662
Accurate coronary artery segmentation from coronary computed tomography angiography is essential for quantitative coronary analysis and clinical decision support. Nevertheless, reliable segmentation remains challenging because of small vessel calibers, complex branching, blurred boundaries, and myocardial interference. We propose a coronary artery segmentation framework that integrates myocardial anatomical priors, structure aware feature encoding, and three dimensional wavelet inverse wavelet transformations. Myocardial priors and residual attention based feature enhancement are incorporated during encoding to strengthen coronary structure representation. Wavelet inverse wavelet based downsampling and upsampling enable joint spatial frequency modeling and preserve multi scale structural consistency, while a multi scale feature fusion module integrates semantic and geometric information in the decoding stage. The model is trained and evaluated on the public ImageCAS dataset using a 3D overlapping patch based strategy with a 7:1:2 split for training, validation, and testing. Experimental results demonstrate that the proposed method achieves a Dice coefficient of 0.8082, Sensitivity of 0.7946, Precision of 0.8471, and an HD95 of 9.77 mm, outperforming several mainstream segmentation models. Ablation studies further confirm the complementary contributions of individual components. The proposed method enables more stable and consistent coronary artery segmentation under complex geometric conditions, providing reliable segmentation results for subsequent coronary structure analysis tasks.
从冠状动脉CT血管造影(CCTA)中准确分割冠状动脉对于定量冠脉分析和临床决策支持至关重要。然而,由于细小的血管口径、复杂的分支结构、模糊边界以及心肌干扰等因素,可靠的分割仍然具有挑战性。 为此,我们提出了一种整合了心肌解剖先验知识、结构感知特征编码以及三维小波反小波变换的冠状动脉分割框架。在编码过程中结合心肌先验信息和基于残差注意力机制的特征增强以强化冠脉结构表示。通过小波反小波单元进行下采样和上采样,实现了联合空间频率建模,并保持了多尺度结构的一致性;同时,在解码阶段采用一个多尺度特征融合模块来整合语义信息与几何信息。 该模型在公开的ImageCAS数据集上进行了训练和评估,采用了基于3D重叠补丁的方法并按7:1:2的比例划分为训练、验证和测试部分。实验结果显示,所提出的方法实现了Dice系数为0.8082,敏感性为0.7946,精确度为0.8471,HD95(距离误差)为9.77毫米,优于几种主流分割模型的性能。消融研究表明各个组成部分之间存在互补贡献。 所提出的这种方法在复杂几何条件下实现了更稳定和一致的冠状动脉分割,为后续冠脉结构分析任务提供了可靠的分割结果。
https://arxiv.org/abs/2512.12539
Unsupervised domain adaptation (UDA) enables semantic segmentation models to generalize from a labeled source domain to an unlabeled target domain. However, existing UDA methods still struggle to bridge the domain gap due to cross-domain contextual ambiguity, inconsistent feature representations, and class-wise pseudo-label noise. To address these challenges, we propose Omni-level Masking for Unsupervised Domain Adaptation (OMUDA), a unified framework that introduces hierarchical masking strategies across distinct representation levels. Specifically, OMUDA comprises: 1) a Context-Aware Masking (CAM) strategy that adaptively distinguishes foreground from background to balance global context and local details; 2) a Feature Distillation Masking (FDM) strategy that enhances robust and consistent feature learning through knowledge transfer from pre-trained models; and 3) a Class Decoupling Masking (CDM) strategy that mitigates the impact of noisy pseudo-labels by explicitly modeling class-wise uncertainty. This hierarchical masking paradigm effectively reduces the domain shift at the contextual, representational, and categorical levels, providing a unified solution beyond existing approaches. Extensive experiments on multiple challenging cross-domain semantic segmentation benchmarks validate the effectiveness of OMUDA. Notably, on the SYNTHIA->Cityscapes and GTA5->Cityscapes tasks, OMUDA can be seamlessly integrated into existing UDA methods and consistently achieving state-of-the-art results with an average improvement of 7%.
无监督领域适应(UDA)使语义分割模型能够从有标签的源域泛化到无标签的目标域。然而,现有的UDA方法仍然难以弥合因跨域上下文模糊性、特征表示不一致以及类别伪标签噪声等原因导致的领域差距。为了解决这些挑战,我们提出了Omni-level Masking for Unsupervised Domain Adaptation(OMUDA),这是一种统一框架,引入了在不同表示层次上的分层掩码策略。具体来说,OMUDA包括: 1. 上下文感知掩码(CAM)策略:该策略能够自适应地区分前景和背景,并平衡全局上下文与局部细节; 2. 特征蒸馏掩码(FDM)策略:通过从预训练模型的知识转移来增强鲁棒且一致的特征学习; 3. 类别解耦掩码(CDM)策略:该策略通过显式建模类别不确定性,减轻了噪声伪标签的影响。 这种分层掩码范例在上下文、表示和分类级别上有效地减少了领域偏移,并提供了一种超越现有方法的统一解决方案。多项具有挑战性的跨域语义分割基准实验验证了OMUDA的有效性。特别地,在SYNTHIA->Cityscapes和GTA5->Cityscapes任务中,OMUDA可以无缝集成到现有的UDA方法中,并始终取得最先进的结果,平均改进达到7%。
https://arxiv.org/abs/2512.12303
Detecting video moments and highlights from natural-language queries have been unified by transformer-based methods. Other works use generative Multimodal LLM (MLLM) to predict moments and/or highlights as text timestamps, utilizing its reasoning capability. While effective, text-based generation cannot provide direct gradients for frame-level predictions because the model only emits language tokens. Although recent Reinforcement Learning (RL) methods attempt to address the issue, we propose a novel approach by applying segmentation objectives directly on the LLM's output tokens. The LLM is fed with a fixed number of frames alongside a prompt that enforces it to output a sequence of continuous "0" and/or "1" characters, with one character per frame. The "0"/"1" characters benefit from the LLM's inherent language capability while also acting as background and foreground probabilities, respectively. Training employs segmentation losses on the probabilities alongside a normal causal LM loss. At inference, beam search generates sequence and logits, acting as moments and saliency scores, respectively. Despite sampling only 25 frames -- less than half of comparable methods -- our method achieved strong highlight detection (56.74 HIT@1) on QVHighlights. Additionally, our efficient method scores above the baseline (35.28 MAP) for moment retrieval. Empirically, segmentation losses provide a stable complementary learning signal even when the causal LM loss plateaus.
基于文本的查询来检测视频中的关键时刻和精彩片段已经可以通过基于变压器的方法统一处理。其他研究则采用生成式的多模态大语言模型(MLLM)来预测时刻和/或亮点,通过时间戳的形式输出,并利用其推理能力进行操作。尽管这种方法有效,但文本生成无法直接为帧级预测提供梯度信息,因为模型只产生语言标记。虽然最近的强化学习方法试图解决这一问题,我们提出了一种新方法:直接在大语言模型(LLM)输出的令牌上应用分割目标。 具体来说,该方法将一个固定数量的视频帧输入到LLM中,并伴随一个提示,使得它能够输出一系列连续的“0”和/或“1”字符,每个字符对应一帧。这些“0”/“1”字符受益于LLM的语言能力的同时也充当了背景和前景的概率值。“0”代表背景,“1”代表前景(即关键帧)。训练过程中采用分割损失与标准因果语言模型损失相结合的方式进行优化。 在推理阶段,通过束搜索生成序列及其logits,序列对应关键时刻的检测结果,而logits则作为这些时刻的重要性评分。尽管只采样了25帧——仅为类似方法的一半数量——我们的方法在QVHighlights数据集上仍达到了出色的精彩片段检测性能(HIT@1为56.74)。此外,在时刻检索任务中,本方法的性能也超过了基线水平(MAP为35.28)。 实验表明,分割损失能够提供稳定的互补学习信号,并且即使因果语言模型损失趋于平稳的情况下依然有效。
https://arxiv.org/abs/2512.12246