The efficacy of Artificial Intelligence (AI) in micro/nano manufacturing is fundamentally constrained by the scarcity of high-quality and physically grounded training data for defect inspection. Lithography defect data from semiconductor industry are rarely accessible for research use, resulting in a shortage of publicly available datasets. To address this bottleneck in lithography, this study proposes a novel methodology for generating large-scale, physically valid defect datasets with pixel-level annotations. The framework begins with the ab initio synthesis of defect layouts using controllable, physics-constrained mathematical morphology operations (erosion and dilation) applied to the original design-level layout. These synthesized layouts, together with their defect-free counterparts, are fabricated into physical samples via high-fidelity digital micromirror device (DMD)-based lithography. Optical micrographs of the synthesized defect samples and their defect-free references are then compared to create consistent defect delineation annotations. Using this methodology, we constructed a comprehensive dataset of 3,530 Optical micrographs containing 13,365 annotated defect instances including four classes: bridge, burr, pinch, and contamination. Each defect instance is annotated with a pixel-accurate segmentation mask, preserving full contour and geometry. The segmentation-based Mask R-CNN achieves AP@0.5 of 0.980, 0.965, and 0.971, compared with 0.740, 0.719, and 0.717 for Faster R-CNN on bridge, burr, and pinch classes, representing a mean AP@0.5 improvement of approximately 34%. For the contamination class, Mask R-CNN achieves an AP@0.5 roughly 42% higher than Faster R-CNN. These consistent gains demonstrate that our proposed methodology to generate defect datasets with pixel-level annotations is feasible for robust AI-based Measurement/Inspection (MI) in semiconductor fabrication.
人工智能(AI)在微纳制造中的有效性从根本上受到高质量且具有物理基础的训练数据稀缺性的限制,尤其是在缺陷检测方面。半导体行业的光刻缺陷数据很少能用于研究用途,导致公开可用的数据集不足。为了解决这一瓶颈问题,本研究提出了一种生成大规模、物理上有效的缺陷数据集的新方法,并提供了像素级别的标注。 该框架从头开始合成缺陷布局,通过可控的数学形态学操作(腐蚀和膨胀)应用于原始设计级布局来实现这一点,这些操作受物理定律约束。随后将这些合成的布局及其无缺陷的对应物通过高保真度的数字微镜设备(DMD)基光刻技术制造成物理样本。接着比较合成缺陷样品及它们无缺陷参考的光学显微图像,以此创建一致的缺陷边界标注。 利用这种方法,我们构建了一个包含3,530张光学显微图的大规模数据集,其中记录了13,365个带注释的缺陷实例,并将其分类为四种类型:桥接、毛刺、挤压和污染。每个缺陷实例均配有像素级精确分割掩模标注,完整保留其轮廓与几何结构。 基于分割的Mask R-CNN在桥接、毛刺和挤压类别的AP@0.5分别为0.980、0.965和0.971,相较Faster R-CNN在同一类别中的表现(分别为0.740、0.719和0.717),平均AP@0.5提高了约34%。对于污染类别,Mask R-CNN的AP@0.5比Faster R-CNN高出了大约42%。 这些一致性的提升表明我们提出的生成带像素级别标注缺陷数据集的方法是可行的,这对于半导体制造中的基于AI的测量/检测(MI)具有重要意义。
https://arxiv.org/abs/2512.09001
A novel deep hybrid Residual-SwinCA-Net segmentation framework is proposed in the study for addressing such challenges by extracting locally correlated and robust features, incorporating residual CNN modules. Furthermore, for learning global dependencies, Swin Transformer blocks are customized using internal residual pathways, which reinforce gradient stability, refine local patterns, and facilitate global feature fusion. Formerly, for enhancing tissue continuity, ultrasound noise suppressions, and accentuating fine structural transitions Laplacian-of-Gaussian regional operator is applied, and for maintaining the morphological integrity of malignant lesion contours, a boundary-oriented operator has been incorporated. Subsequently, a contraction strategy was applied stage-wise by progressively reducing features-map progressively for capturing scale invariance and enhancing the robustness of structural variability. In addition, each decoder level prior augmentation integrates a new Multi-Scale Channel Attention and Squeezing (MSCAS) module. The MSCAS selectively emphasizes encoder salient maps, retains discriminative global context, and complementary local structures with minimal computational cost while suppressing redundant activations. Finally, the Pixel-Attention module encodes class-relevant spatial cues by adaptively weighing malignant lesion pixels while suppressing background interference. The Residual-SwinCA-Net and existing CNNs/ViTs techniques have been implemented on the publicly available BUSI dataset. The proposed Residual-SwinCA-Net framework outperformed and achieved 99.29% mean accuracy, 98.74% IoU, and 0.9041 Dice for breast lesion segmentation. The proposed Residual-SwinCA-Net framework improves the BUSI lesion diagnostic performance and strengthens timely clinical decision-making.
本研究提出了一种新颖的深度混合残差-SwinCA网络分割框架,旨在通过提取局部关联且鲁棒的特征,并结合残差CNN模块来应对相关挑战。此外,在学习全局依赖关系时,利用内部残差路径对Swin Transformer块进行了定制化处理,从而增强梯度稳定性、细化局部模式并促进全局特征融合。为了提升组织连续性、抑制超声噪声以及强调精细结构变化,采用了高斯拉普拉斯(Laplacian-of-Gaussian)区域算子。同时,为保持恶性病变轮廓的形态完整性,引入了一种边界导向操作器。随后,在每个阶段逐步减少特征图以捕获尺度不变性和增强结构性变异性的同时实施了一个收缩策略。此外,每一级解码器之前都集成了一个新的多尺度通道注意力及压缩(MSCAS)模块,该模块在最小化计算成本的情况下,选择性地强调编码器显著图、保留有区分力的全局上下文,并且补充局部结构,同时抑制冗余激活。 最后,像素注意模块通过自适应加权恶性病变像素来编码与类别相关的空间线索,从而抑制背景干扰。Residual-SwinCA-Net框架和现有的CNNs/ViTs技术已在公开可用的BUSI数据集上进行了实现。实验结果表明,提出的Residual-SwinCA-Net框架优于现有方法,并在乳腺病灶分割方面达到了99.29%的平均准确率、98.74%的IoU以及0.9041的Dice系数。该框架提高了BUSI病变诊断性能,从而增强了临床决策的有效性和及时性。
https://arxiv.org/abs/2512.08243
Accurate three-dimensional delineation of liver tumors on contrast-enhanced CT is a prerequisite for treatment planning, navigation and response assessment, yet manual contouring is slow, observer-dependent and difficult to standardise across centres. Automatic segmentation is complicated by low lesion-parenchyma contrast, blurred or incomplete boundaries, heterogeneous enhancement patterns, and confounding structures such as vessels and adjacent organs. We propose a hybrid framework that couples an attention-enhanced cascaded U-Net with handcrafted radiomics and voxel-wise 3D CNN refinement for joint liver and liver-tumor segmentation. First, a 2.5D two-stage network with a densely connected encoder, sub-pixel convolution decoders and multi-scale attention gates produces initial liver and tumor probability maps from short stacks of axial slices. Inter-slice temporal consistency is then enforced by a simple three-slice refinement rule along the cranio-caudal direction, which restores thin and tiny lesions while suppressing isolated noise. Next, 728 radiomic descriptors spanning intensity, texture, shape, boundary and wavelet feature groups are extracted from candidate lesions and reduced to 20 stable, highly informative features via multi-strategy feature selection; a random forest classifier uses these features to reject false-positive regions. Finally, a compact 3D patch-based CNN derived from AlexNet operates in a narrow band around the tumor boundary to perform voxel-level relabelling and contour smoothing.
在对比增强CT(CECT)上精确描绘肝脏肿瘤的三维轮廓是治疗计划、导航和反应评估的前提条件,然而手动勾画过程缓慢、依赖观察者且难以跨中心标准化。自动分割则因低病变-肝实质对比度、模糊或不完全边界、异质性强化模式以及血管和其他相邻器官等混淆结构而变得复杂。 我们提出了一种结合注意力增强的级联U-Net和手工设计的放射组学特征及基于体素的3D卷积神经网络(CNN)精细调整的混合框架,用于同时进行肝脏和肝脏肿瘤分割。首先,一个2.5D两阶段网络使用密集连接编码器、亚像素卷积解码器以及多尺度注意力门从短轴切片堆栈中生成初步的肝脏和肿瘤概率图。 接着,在头尾方向上通过简单的三片精细规则强制执行片间时间一致性,以恢复细微且小的病灶同时抑制孤立噪声。接下来,从候选病灶中提取728个放射组学描述符(包括强度、纹理、形状、边界以及小波特征分组),并通过多策略特征选择减少到20个稳定而高度信息量丰富的特征;随机森林分类器使用这些特征来拒绝假阳性区域。 最后,基于AlexNet的紧凑3D补丁网络在肿瘤边界的狭窄带内执行体素级别的重新标记和轮廓平滑操作。
https://arxiv.org/abs/2512.07574
Recent text-to-image models, such as Stable Diffusion, have achieved impressive visual quality, yet they often suffer from geometric inconsistencies that undermine the structural realism of generated scenes. One prominent issue is vanishing point inconsistency, where projections of parallel lines fail to converge correctly in 2D space. This leads to structurally implausible geometry that degrades spatial realism, especially in architectural scenes. We propose ControlVP, a user-guided framework for correcting vanishing point inconsistencies in generated images. Our approach extends a pre-trained diffusion model by incorporating structural guidance derived from building contours. We also introduce geometric constraints that explicitly encourage alignment between image edges and perspective cues. Our method enhances global geometric consistency while maintaining visual fidelity comparable to the baselines. This capability is particularly valuable for applications that require accurate spatial structure, such as image-to-3D reconstruction. The dataset and source code are available at this https URL .
最近的文本到图像生成模型,如Stable Diffusion,在视觉质量方面取得了显著成就,但这些模型常常存在几何不一致性的问题,这会损害生成场景的结构真实感。其中一个突出问题是消失点不一致,即平行线在二维空间中的投影未能正确汇聚。这种情况会导致结构上不合理且破坏了空间的真实感,尤其是在建筑场景中更为明显。 为了解决这些问题,我们提出了ControlVP,这是一种用户引导框架,用于纠正生成图像中的消失点不一致性。我们的方法通过将从建筑物轮廓导出的结构指导信息加入到预训练的扩散模型中来扩展该模型。此外,我们引入了几何约束条件,以明确地促进图像边缘与透视线索之间的对齐。 这种方法在保持视觉保真度与基线水平相当的同时增强了全局几何一致性。这种能力对于需要准确空间结构的应用特别有价值,例如从图像到3D重建的应用。数据集和源代码可在该链接中获取:[此URL](请将“this https URL”替换为实际提供的链接)。
https://arxiv.org/abs/2512.07504
A robust nonproliferation regime has contained the spread of nuclear weapons to just nine states. Yet, emerging and disruptive technologies are reshaping the landscape of nuclear risks, presenting a critical juncture for decision makers. This article lays out the contours of an overlooked but intensifying technological arms race for nuclear (in)visibility, driven by the interplay between proliferation-enabling technologies (PETs) and detection-enhancing technologies (DETs). We argue that the strategic pattern of proliferation will be increasingly shaped by the innovation pace in these domains. Artificial intelligence (AI) introduces unprecedented complexity to this equation, as its rapid scaling and knowledge substitution capabilities accelerate PET development and challenge traditional monitoring and verification methods. To analyze this dynamic, we develop a formal model centered on a Relative Advantage Index (RAI), quantifying the shifting balance between PETs and DETs. Our model explores how asymmetric technological advancement, particularly logistic AI-driven PET growth versus stepwise DET improvements, expands the band of uncertainty surrounding proliferation detectability. Through replicable scenario-based simulations, we evaluate the impact of varying PET growth rates and DET investment strategies on cumulative nuclear breakout risk. We identify a strategic fork ahead, where detection may no longer suffice without broader PET governance. Governments and international organizations should accordingly invest in policies and tools agile enough to keep pace with tomorrow's technology.
一个稳健的不扩散制度已经将核武器扩散限制在了九个国家。然而,新兴和颠覆性技术正在重塑核风险的地缘政治格局,为决策者带来了关键时刻。本文阐述了一个被忽视但日益加剧的技术军备竞赛,这场竞赛围绕着核(不可见)性和由促进扩散技术(PETs)与增强检测技术(DETs)之间的相互作用所驱动。 我们论证说,未来的扩散战略将越来越受到这些领域创新速度的影响。人工智能(AI)为这一等式引入了前所未有的复杂性,其迅速扩展和知识替代能力加速了PET的发展,并挑战了传统的监测和验证方法。为了分析这种动态变化,我们开发了一个以相对优势指数(RAI)为中心的正式模型,量化PETs与DETs之间的平衡变动。 我们的模型探讨了技术发展不平衡(尤其是由逻辑AI驱动的PET增长与逐步改进的DET)如何扩大关于扩散可检测性的不确定性范围。通过基于情景的可重复模拟,我们评估了不同PET增长率和DET投资策略对累积核突破风险的影响。我们识别出一个战略岔路口,在此路口,仅靠增强检测可能不再足够有效,除非同时加强对促进扩散技术(PETs)的治理。 因此,政府和国际组织应当投资于能够与明日技术保持同步的政策工具和手段。
https://arxiv.org/abs/2512.07487
Singing Voice Synthesis (SVS) remains constrained in practical deployment due to its strong dependence on accurate phoneme-level alignment and manually annotated melody contours, requirements that are resource-intensive and hinder scalability. To overcome these limitations, we propose a melody-driven SVS framework capable of synthesizing arbitrary lyrics following any reference melody, without relying on phoneme-level alignment. Our method builds on a Diffusion Transformer (DiT) architecture, enhanced with a dedicated melody extraction module that derives melody representations directly from reference audio. To ensure robust melody encoding, we employ a teacher model to guide the optimization of the melody extractor, alongside an implicit alignment mechanism that enforces similarity distribution constraints for improved melodic stability and coherence. Additionally, we refine duration modeling using weakly annotated song data and introduce a Flow-GRPO reinforcement learning strategy with a multi-objective reward function to jointly enhance pronunciation clarity and melodic fidelity. Experiments show that our model achieves superior performance over existing approaches in both objective measures and subjective listening tests, especially in zero-shot and lyric adaptation settings, while maintaining high audio quality without manual annotation. This work offers a practical and scalable solution for advancing data-efficient singing voice synthesis. To support reproducibility, we release our inference code and model checkpoints.
歌声合成(SVS)由于其对准确的音素级对齐和人工标注旋律轮廓的高度依赖,仍然在实际部署中受到限制。这些需求耗费资源且阻碍了可扩展性。为克服这些局限性,我们提出了一种基于旋律驱动的SVS框架,该框架能够在没有音素级对齐的情况下根据任何参考旋律合成任意歌词。我们的方法建立在一个改进的Diffusion Transformer(DiT)架构之上,并加入了一个专门用于从参考音频中直接提取旋律表示的模块。为了确保稳定的旋律编码,我们使用一个教师模型来指导旋律抽取器的优化过程,并且引入了一种隐式对齐机制,通过施加相似度分布约束来提高旋律稳定性和连贯性。 此外,我们利用弱标注歌曲数据改进持续时间建模,并采用具有多目标奖励函数的Flow-GRPO强化学习策略,以同时提升发音清晰度和旋律保真度。实验表明,在客观衡量标准及主观听觉测试中,我们的模型在零样本设置和歌词适应环境中均优于现有方法,并且在没有人工注释的情况下仍能保持高质量音频输出。 本研究为推进数据高效歌声合成提供了一个实用而可扩展的解决方案。为了支持再现性,我们发布了推理代码和模型检查点。
https://arxiv.org/abs/2512.04779
Egyptian hieroglyphs, the ancient Egyptian writing system, are composed entirely of drawings. Translating these glyphs into English poses various challenges, including the fact that a single glyph can have multiple meanings. Deep learning translation applications are evolving rapidly, producing remarkable results that significantly impact our lives. In this research, we propose a method for the automatic recognition and translation of ancient Egyptian hieroglyphs from images to English. This study utilized two datasets for classification and translation: the Morris Franken dataset and the EgyptianTranslation dataset. Our approach is divided into three stages: segmentation (using Contour and Detectron2), mapping symbols to Gardiner codes, and translation (using the CNN model). The model achieved a BLEU score of 42.2, a significant result compared to previous research.
埃及象形文字,作为古埃及的书写系统,完全由图画组成。将这些符号翻译成英文存在诸多挑战,包括单个符号可能具有多种含义的情况。深度学习翻译应用正在迅速发展,并产生显著成果,对我们的生活产生了重要影响。在本研究中,我们提出了一种方法,用于从图像自动识别和翻译古埃及象形文字到英语文本。这项研究利用了两个数据集进行分类和翻译:Morris Franken 数据集和EgyptianTranslation 数据集。我们的方法分为三个阶段:分割(使用Contour 和Detectron2)、将符号映射到Gardiner 代码、以及翻译(使用CNN 模型)。该模型在BLEU 得分上达到了42.2,相比之前的研究而言,这是一个非常显著的结果。 这一研究的成果表明深度学习技术对于古文字学领域具有潜在的重要应用价值。尽管当前的结果已经相当不错,但仍存在进一步改进的空间,尤其是在提升翻译准确度和处理复杂句子结构方面。未来的工作可以探索更多样化的数据集以及更为复杂的模型架构来增强自动翻译系统的性能。 注:BLEU (Bilingual Evaluation Understudy) 是一种评估机器翻译质量的指标,通常用于比较不同系统或方法之间的表现差异。
https://arxiv.org/abs/2512.03817
The architecture, engineering and construction (AEC) industry is constantly evolving to meet the demand for sustainable and effective design and construction of the built environment. In the literature, two primary deposition techniques for large-scale 3D concrete printing (3DCP) have been described, namely extrusion-based (Contour Crafting-CC) and shotcrete 3D printing (SC3DP) methods. The deposition methods use a digitally controlled nozzle to print material layer by layer. The continuous flow of concrete material used to create the printed structure is called a filament or layer. As these filaments are the essential structure defining the printed object, the filaments' geometry quality control is crucial. This paper presents an automated procedure for quality control (QC) of filaments in extrusion-based and SC3DP printing methods. The paper also describes a workflow that is independent of the sensor used for data acquisition, such as a camera, a structured light system (SLS) or a terrestrial laser scanner (TLS). This method can be used with materials in either the fresh or cured state. Thus, it can be used for online and post-printing QC.
建筑、工程和施工(AEC)行业不断演变,以满足可持续且有效设计和建造建成环境的需求。在文献中,已描述了用于大规模3D混凝土打印(3DCP)的两种主要沉积技术:基于挤出的方法(轮廓工艺-CC)和喷射混凝土3D打印方法(SC3DP)。这些沉积方法使用数字控制的喷嘴逐层打印材料。用来创建打印结构的连续流动的混凝土材料被称为纤维或层次。由于这些纤维是定义打印物体的基本结构,因此确保纤维几何质量至关重要。本文介绍了一种用于基于挤出和SC3DP打印方法中纤维质量控制(QC)的自动化程序。此外,该论文还描述了一个与数据采集所用传感器类型无关的工作流程,例如相机、结构光系统(SLS)或地面激光扫描仪(TLS)。此方法可用于新鲜或固化的材料状态,因此可以用于在线和后打印的质量控制。
https://arxiv.org/abs/2512.00091
Accurate extraction and segmentation of the cerebral arteries from digital subtraction angiography (DSA) sequences is essential for developing reliable clinical management models of complex cerebrovascular diseases. Conventional loss functions often rely solely on pixel-wise overlap, overlooking the geometric and physical consistency of vascular boundaries, which can lead to fragmented or unstable vessel predictions. To overcome this limitation, we propose a novel \textit{Physics-Informed Loss} (PIL) that models the interaction between the predicted and ground-truth boundaries as an elastic process inspired by dislocation theory in materials physics. This formulation introduces a physics-based regularization term that enforces smooth contour evolution and structural consistency, allowing the network to better capture fine vascular geometry. The proposed loss is integrated into several segmentation architectures, including U-Net, U-Net++, SegFormer, and MedFormer, and evaluated on two public benchmarks: DIAS and DSCA. Experimental results demonstrate that PIL consistently outperforms conventional loss functions such as Cross-Entropy, Dice, Active Contour, and Surface losses, achieving superior sensitivity, F1 score, and boundary coherence. These findings confirm that the incorporation of physics-based boundary interactions into deep neural networks improves both the precision and robustness of vascular segmentation in dynamic angiographic imaging. The implementation of the proposed method is publicly available at this https URL.
https://arxiv.org/abs/2511.20501
Weakly supervised semantic segmentation (WSSS) must learn dense masks from noisy, under-specified cues. We revisit the SegFormer decoder and show that three small, synergistic changes make weak supervision markedly more effective-without altering the MiT backbone or relying on heavy post-processing. Our method, CrispFormer, augments the decoder with: (1) a boundary branch that supervises thin object contours using a lightweight edge head and a boundary-aware loss; (2) an uncertainty-guided refiner that predicts per-pixel aleatoric uncertainty and uses it to weight losses and gate a residual correction of the segmentation logits; and (3) a dynamic multi-scale fusion layer that replaces static concatenation with spatial softmax gating over multi-resolution features, optionally modulated by uncertainty. The result is a single-pass model that preserves crisp boundaries, selects appropriate scales per location, and resists label noise from weak cues. Integrated into a standard WSSS pipeline (seed, student, and EMA relabeling), CrispFormer consistently improves boundary F-score, small-object recall, and mIoU over SegFormer baselines trained on the same seeds, while adding minimal compute. Our decoder-centric formulation is simple to implement, broadly compatible with existing SegFormer variants, and offers a reproducible path to higher-fidelity masks from image-level supervision.
https://arxiv.org/abs/2511.19765
This thesis presents methods and datasets to investigate cartographic heritage on a large scale and from a cultural perspective. Heritage institutions worldwide have digitized more than one million maps, and automated techniques now enable large-scale recognition and extraction of map content. Yet these methods have engaged little with the history of cartography, or the view that maps are semantic-symbolic systems, and cultural objects reflecting political and epistemic expectations. This work leverages a diverse corpus of 771,561 map records and 99,715 digitized images aggregated from 38 digital catalogs. After normalization, the dataset includes 236,925 contributors and spans six centuries, from 1492 to 1948. These data make it possible to chart geographic structures and the global chronology of map publication. The spatial focus of cartography is analyzed in relation to political dynamics, evidencing links between Atlantic maritime charting, the triangular trade, and colonial expansion. Further results document the progression of national, domestic focus and the impact of military conflicts on publication volumes. The research introduces semantic segmentation techniques and object detection models for the generic recognition of land classes and cartographic signs, trained on annotated data and synthetic images. The analysis of land classes shows that maps are designed images whose framing and composition emphasize features through centering and semantic symmetries. The study of cartographic figuration encodes 63 M signs and 25 M fragments into a latent visual space, revealing figurative shifts such as the replacement of relief hachures by terrain contours and showing that signs tend to form locally consistent systems. Analyses of collaboration and diffusion highlight the role of legitimacy, larger actors, and major cities in the spread of figurative norms and semiotic cultures.
https://arxiv.org/abs/2511.19538
We present a corpus-based investigation of how the pitch contours of monosyllabic words are realized in spontaneous conversational Mandarin, focusing on the effects of words' meanings. We used the generalized additive model to decompose a given observed pitch contour into a set of component pitch contours that are tied to different control variables and semantic predictors. Even when variables such as word duration, gender, speaker identity, tonal context, vowel height, and utterance position are controlled for, the effect of word remains a strong predictor of tonal realization. We present evidence that this effect of word is a semantic effect: word sense is shown to be a better predictor than word, and heterographic homophones are shown to have different pitch contours. The strongest evidence for the importance of semantics is that the pitch contours of individual word tokens can be predicted from their contextualized embeddings with an accuracy that substantially exceeds a permutation baseline. For phonetics, distributional semantics is a new kid on the block. Although our findings challenge standard theories of Mandarin tone, they fit well within the theoretical framework of the Discriminative Lexicon Model.
https://arxiv.org/abs/2511.17337
Due to the high cost of annotation or the rarity of some diseases, medical image segmentation is often limited by data scarcity and the resulting overfitting problem. Self-supervised learning and semi-supervised learning can mitigate the data scarcity challenge to some extent. However, both of these paradigms are complex and require either hand-crafted pretexts or well-defined pseudo-labels. In contrast, data augmentation represents a relatively simple and straightforward approach to addressing data scarcity issues. It has led to significant improvements in image recognition tasks. However, the effectiveness of local image editing augmentation techniques in the context of segmentation has been less explored. We propose HSMix, a novel approach to local image editing data augmentation involving hard and soft mixing for medical semantic segmentation. In our approach, a hard-augmented image is created by combining homogeneous regions (superpixels) from two source images. A soft mixing method further adjusts the brightness of these composed regions with brightness mixing based on locally aggregated pixel-wise saliency coefficients. The ground-truth segmentation masks of the two source images undergo the same mixing operations to generate the associated masks for the augmented images. Our method fully exploits both the prior contour and saliency information, thus preserving local semantic information in the augmented images while enriching the augmentation space with more diversity. Our method is a plug-and-play solution that is model agnostic and applicable to a range of medical imaging modalities. Extensive experimental evidence has demonstrated its effectiveness in a variety of medical segmentation tasks. The source code is available in this https URL.
https://arxiv.org/abs/2511.17614
Reliable 3D-2D alignment between intraoral scan (IOS) models and lateral cephalometric radiographs is critical for orthodontic diagnosis, yet conventional intensity-driven registration methods struggle under real clinical conditions, where cephalograms exhibit projective magnification, geometric distortion, low-contrast dental crowns, and acquisition-dependent variation. These factors hinder the stability of appearance-based similarity metrics and often lead to convergence failures or anatomically implausible alignments. To address these limitations, we propose DentalSCR, a pose-stable, contour-guided framework for accurate and interpretable silhouette-to-contour registration. Our method first constructs a U-Midline Dental Axis (UMDA) to establish a unified cross-arch anatomical coordinate system, thereby stabilizing initialization and standardizing projection geometry across cases. Using this reference frame, we generate radiograph-like projections via a surface-based DRR formulation with coronal-axis perspective and Gaussian splatting, which preserves clinical source-object-detector magnification and emphasizes external silhouettes. Registration is then formulated as a 2D similarity transform optimized with a symmetric bidirectional Chamfer distance under a hierarchical coarse-to-fine schedule, enabling both large capture range and subpixel-level contour agreement. We evaluate DentalSCR on 34 expert-annotated clinical cases. Experimental results demonstrate substantial reductions in landmark error-particularly at posterior teeth-tighter dispersion on the lower jaw, and low Chamfer and controlled Hausdorff distances at the curve level. These findings indicate that DentalSCR robustly handles real-world cephalograms and delivers high-fidelity, clinically inspectable 3D--2D alignment, outperforming conventional baselines.
https://arxiv.org/abs/2511.14343
Most neural network quantization methods apply uniform bit precision across spatial regions, ignoring the heterogeneous structural and textural complexity of visual data. This paper introduces MCAQ-YOLO, a morphological complexity-aware quantization framework for object detection. The framework employs five morphological metrics - fractal dimension, texture entropy, gradient variance, edge density, and contour complexity - to characterize local visual morphology and guide spatially adaptive bit allocation. By correlating these metrics with quantization sensitivity, MCAQ-YOLO dynamically adjusts bit precision according to spatial complexity. In addition, a curriculum-based quantization-aware training scheme progressively increases quantization difficulty to stabilize optimization and accelerate convergence. Experimental results demonstrate a strong correlation between morphological complexity and quantization sensitivity and show that MCAQ-YOLO achieves superior detection accuracy and convergence efficiency compared with uniform quantization. On a safety equipment dataset, MCAQ-YOLO attains 85.6 percent mAP@0.5 with an average of 4.2 bits and a 7.6x compression ratio, yielding 3.5 percentage points higher mAP than uniform 4-bit quantization while introducing only 1.8 ms of additional runtime overhead per image. Cross-dataset validation on COCO and Pascal VOC further confirms consistent performance gains, indicating that morphology-driven spatial quantization can enhance efficiency and robustness for computationally constrained, safety-critical visual recognition tasks.
https://arxiv.org/abs/2511.12976
Brain tumors are among the most clinically significant neurological diseases and remain a major cause of morbidity and mortality due to their aggressive growth and structural heterogeneity. As tumors expand, they induce substantial anatomical deformation that disrupts both local tissue organization and global brain architecture, complicating diagnosis, treatment planning, and surgical navigation. Yet a subject-specific reference of how the brain would appear without tumor-induced changes is fundamentally unobtainable in clinical practice. We present BrainNormalizer, an anatomy-informed diffusion framework that reconstructs pseudo-healthy MRIs directly from tumorous scans by conditioning the generative process on boundary cues extracted from the subject's own anatomy. This boundary-guided conditioning enables anatomically plausible pseudo-healthy reconstruction without requiring paired non-tumorous and tumorous scans. BrainNormalizer employs a two-stage training strategy. The pretrained diffusion model is first adapted through inpainting-based fine-tuning on tumorous and non-tumorous scans. Next, an edge-map-guided ControlNet branch is trained to inject fine-grained anatomical contours into the frozen decoder while preserving learned priors. During inference, a deliberate misalignment strategy pairs tumorous inputs with non-tumorous prompts and mirrored contralateral edge maps, leveraging hemispheric correspondence to guide reconstruction. On the BraTS2020 dataset, BrainNormalizer achieves strong quantitative performance and qualitatively produces anatomically plausible reconstructions in tumor-affected regions while retaining overall structural coherence. BrainNormalizer provides clinically reliable anatomical references for treatment planning and supports new research directions in counterfactual modeling and tumor-induced deformation analysis.
https://arxiv.org/abs/2511.12853
Generative diffusion models show promise for data augmentation. However, applying them to fine-grained tasks presents a significant challenge: ensuring synthetic images accurately capture the subtle, category-defining features critical for high fidelity. Standard approaches, such as text-based Classifier-Free Guidance (CFG), often lack the required specificity, potentially generating misleading examples that degrade fine-grained classifier performance. To address this, we propose Hierarchically Guided Fine-grained Augmentation (HiGFA). HiGFA leverages the temporal dynamics of the diffusion sampling process. It employs strong text and transformed contour guidance with fixed strengths in the early-to-mid sampling stages to establish overall scene, style, and structure. In the final sampling stages, HiGFA activates a specialized fine-grained classifier guidance and dynamically modulates the strength of all guidance signals based on prediction confidence. This hierarchical, confidence-driven orchestration enables HiGFA to generate diverse yet faithful synthetic images by intelligently balancing global structure formation with precise detail refinement. Experiments on several FGVC datasets demonstrate the effectiveness of HiGFA.
https://arxiv.org/abs/2511.12547
Accurate segmentation and measurement of lithography scanning electron microscope (SEM) images are crucial for ensuring precise process control, optimizing device performance, and advancing semiconductor manufacturing yield. Lithography segmentation requires pixel-level delineation of groove contours and consistent performance across diverse pattern geometries and process window. However, existing methods often lack the necessary precision and robustness, limiting their practical applicability. To overcome this challenge, we propose LithoSeg, a coarse-to-fine network tailored for lithography segmentation. In the coarse stage, we introduce a Human-in-the-Loop Bootstrapping scheme for the Segment Anything Model (SAM) to attain robustness with minimal supervision. In the subsequent fine stage, we recast 2D segmentation as 1D regression problem by sampling groove-normal profiles using the coarse mask and performing point-wise refinement with a lightweight MLP. LithoSeg outperforms previous approaches in both segmentation accuracy and metrology precision while requiring less supervision, offering promising prospects for real-world applications.
https://arxiv.org/abs/2511.12005
Image segmentation is a core task in image processing, yet many methods degrade when images are heavily corrupted by noise and exhibit intensity inhomogeneity. Within the iterative-convolution thresholding method (ICTM) framework, we propose a variational segmentation model that integrates denoising terms. Specifically, the denoising component consists of an I-divergence term and an adaptive total-variation (TV) regularizer, making the model well suited to images contaminated by Gamma--distributed multiplicative noise and Poisson noise. A spatially adaptive weight derived from a gray-level indicator guides diffusion differently across regions of varying intensity. To further address intensity inhomogeneity, we estimate a smoothly varying bias field, which improves segmentation accuracy. Regions are represented by characteristic functions, with contour length encoded accordingly. For efficient optimization, we couple ICTM with a relaxed modified scalar auxiliary variable (RMSAV) scheme. Extensive experiments on synthetic and real-world images with intensity inhomogeneity and diverse noise types show that the proposed model achieves superior accuracy and robustness compared with competing approaches.
https://arxiv.org/abs/2511.08988
Egocentric visual query localization is vital for embodied AI and VR/AR, yet remains challenging due to camera motion, viewpoint changes, and appearance variations. We present EAGLE, a novel framework that leverages episodic appearance- and geometry-aware memory to achieve unified 2D-3D visual query localization in egocentric vision. Inspired by avian memory consolidation, EAGLE synergistically integrates segmentation guided by an appearance-aware meta-learning memory (AMM), with tracking driven by a geometry-aware localization memory (GLM). This memory consolidation mechanism, through structured appearance and geometry memory banks, stores high-confidence retrieval samples, effectively supporting both long- and short-term modeling of target appearance variations. This enables precise contour delineation with robust spatial discrimination, leading to significantly improved retrieval accuracy. Furthermore, by integrating the VQL-2D output with a visual geometry grounded Transformer (VGGT), we achieve a efficient unification of 2D and 3D tasks, enabling rapid and accurate back-projection into 3D space. Our method achieves state-ofthe-art performance on the Ego4D-VQ benchmark.
https://arxiv.org/abs/2511.08007