Maintenance procedures in manufacturing facilities are often documented as flowcharts in static PDFs or scanned images. They encode procedural knowledge essential for asset lifecycle management, yet inaccessible to modern operator support systems. Vision-language models, the dominant paradigm for image understanding, struggle to reconstruct connection topology from such diagrams. We present FlowExtract, a pipeline for extracting directed graphs from ISO 5807-standardized flowcharts. The system separates element detection from connectivity reconstruction, using YOLOv8 and EasyOCR for standard domain-aligned node detection and text extraction, combined with a novel edge detection method that analyzes arrowhead orientations and traces connecting lines backward to source nodes. Evaluated on industrial troubleshooting guides, FlowExtract achieves very high node detection and substantially outperforms vision-language model baselines on edge extraction, offering organizations a practical path toward queryable procedural knowledge representations. The implementation is available athttps://github.com/guille-gil/FlowExtract.
https://arxiv.org/abs/2604.06770
The accurate classification of brain tumors from MRI scans is essential for effective diagnosis and treatment planning. This paper presents a weighted ensemble learning approach that combines deep learning and traditional machine learning models to improve classification performance. The proposed system integrates multiple classifiers, including ResNet101, DenseNet121, Xception, CNN-MRI, and ResNet50 with edge-enhanced images, SVM, and KNN with HOG features. A weighted voting mechanism assigns higher influence to models with better individual accuracy, ensuring robust decision-making. Image processing techniques such as Balance Contrast Enhancement, K-means clustering, and Canny edge detection are applied to enhance feature extraction. Experimental evaluations on the Figshare and Kaggle MRI datasets demonstrate that the proposed method achieves state-of-the-art accuracy, outperforming existing models. These findings highlight the potential of ensemble-based learning for improving brain tumor classification, offering a reliable and scalable framework for medical image analysis.
https://arxiv.org/abs/2603.28357
Edge detection is a fundamental image analysis task that underpins numerous high-level vision applications. Recent advances in Transformer architectures have significantly improved edge quality by capturing long-range dependencies, but this often comes with computational overhead. Achieving higher pixel-level accuracy requires increased input resolution, further escalating computational cost and limiting practical deployment. Building on the strong representational capacity of recent Transformer-based edge detectors, we propose an Adaptive Multi-stage non-edge Pruning framework for Edge Detection(Amped). Amped identifies high-confidence non-edge tokens and removes them as early as possible to substantially reduce computation, thus retaining high accuracy while cutting GFLOPs and accelerating inference with minimal performance loss. Moreover, to mitigate the structural complexity of existing edge detection networks and facilitate their integration into real-world systems, we introduce a simple yet high-performance Transformer-based model, termed Streamline Edge Detector(SED). Applied to both existing detectors and our SED, the proposed pruning strategy provides a favorable balance between accuracy and efficiency-reducing GFLOPs by up to 40% with only a 0.4% drop in ODS F-measure. In addition, despite its simplicity, SED achieves a state-of-the-art ODS F-measure of 86.5%. The code will be released.
https://arxiv.org/abs/2603.27661
Learning-based edge detection models trained with cross-entropy loss often suffer from thick edge predictions, which deviate from the crisp, single-pixel annotations typically provided by humans. While previous approaches to achieving crisp edges have focused on designing specialized loss functions or modifying network architectures, we show that a carefully designed training and inference strategy alone is sufficient to achieve human-like edge quality. In this work, we introduce the Masked Edge Prediction MOdel (MEMO), which produces both accurate and crisp edges using only cross-entropy loss. We first construct a large-scale synthetic edge dataset to pre-train MEMO, enhancing its generalization ability. Subsequent fine-tuning on downstream datasets requires only a lightweight module comprising 1.2\% additional parameters. During training, MEMO learns to predict edges under varying ratios of input masking. A key insight guiding our inference is that thick edge predictions typically exhibit a confidence gradient: high in the center and lower toward the boundaries. Leveraging this, we propose a novel progressive prediction strategy that sequentially finalizes edge predictions in order of prediction confidence, resulting in thinner and more precise contours. Our method achieves visually appealing, post-processing-free, human-like edge maps and outperforms prior methods on crispness-aware evaluations.
基于交叉熵损失实现精准锐利的边缘检测,一直是计算机视觉领域的一项挑战。本文提出了MEMO模型,仅通过交叉熵损失即可生成既准确又锐利的边缘。我们首先构建了一个大规模合成边缘数据集用于预训练MEMO,以增强其泛化能力。随后在各类下游数据集上的微调仅需一个轻量级模块,该模块仅增加1.2%的参数。在训练过程中,MEMO学习在不同比例的输入掩码下预测边缘。其推理过程的核心思想是:粗边缘预测通常呈现置信度梯度——中心区域置信度高,向边缘边界逐渐降低。基于此,我们提出一种新颖的渐进式预测策略,该策略按预测置信度顺序逐步确定边缘,从而得到更细、更精确的轮廓。我们的方法能够生成无需后处理、视觉上媲美人工标注的优美边缘图,并在锐利度感知的评估指标上超越了以往方法。
https://arxiv.org/abs/2603.20782
This study presents a novel algorithm based on graph theory for the precise segmentation and measurement of detonation cells from 3D pressure traces, termed detonation lattices, addressing the limitations of manual and primitive 2D edge detection methods prevalent in the field. Using a segmentation model, the proposed training-free algorithm is designed to accurately extract cellular patterns, a longstanding challenge in detonations research. First, the efficacy of segmentation on generated data is shown with a prediction error 2%. Next, 3D simulation data is used to establish performance of the graph-based workflow. The results of statistics and joint probability densities show oblong cells aligned with the wave propagation axis with 17% deviation, whereas larger dispersion in volume reflects cubic amplification of linear variability. Although the framework is robust, it remains challenging to reliably segment and quantify highly complex cellular patterns. However, the graph-based formulation generalizes across diverse cellular geometries, positioning it as a practical tool for detonation analysis and a strong foundation for future extensions in triple-point collision studies.
https://arxiv.org/abs/2603.16524
We present SPRITETOMESH, a fully automatic pipeline for converting 2D game sprite images into triangle meshes compatible with skeletal animation frameworks such as Spine2D. Creating animation-ready meshes is traditionally a tedious manual process requiring artists to carefully place vertices along visual boundaries, a task that typically takes 15-60 minutes per sprite. Our method addresses this through a hybrid learned-algorithmic approach. A segmentation network (EfficientNet-B0 encoder with U-Net decoder) trained on over 100,000 sprite-mask pairs from 172 games achieves an IoU of 0.87, providing accurate binary masks from arbitrary input images. From these masks, we extract exterior contour vertices using Douglas-Peucker simplification with adaptive arc subdivision, and interior vertices along visual boundaries detected via bilateral-filtered multi-channel Canny edge detection with contour-following placement. Delaunay triangulation with mask-based centroid filtering produces the final mesh. Through controlled experiments, we demonstrate that direct vertex position prediction via neural network heatmap regression is fundamentally not viable for this task: the heatmap decoder consistently fails to converge (loss plateau at 0.061) while the segmentation decoder trains normally under identical conditions. We attribute this to the inherently artistic nature of vertex placement - the same sprite can be meshed validly in many different ways. This negative result validates our hybrid design: learned segmentation where ground truth is unambiguous, algorithmic placement where domain heuristics are appropriate. The complete pipeline processes a sprite in under 3 seconds, representing a speedup of 300x-1200x over manual creation. We release our trained model to the game development community.
我们介绍了SPRITETOMESH,这是一种全自动流程,用于将2D游戏精灵图像转换为与Spine2D等骨骼动画框架兼容的三角网格。传统的做法是手动创建适合动画制作的网格,这个过程既耗时又繁琐,需要艺术家仔细地沿着视觉边界放置顶点,通常每个精灵花费15到60分钟的时间。我们的方法通过混合学习算法的方法解决了这个问题。 该流程包括一个经过训练的分割网络(使用EfficientNet-B0编码器和U-Net解码器),它在一个包含超过10万对游戏遮罩图像的数据库上进行训练,涉及来自172个不同游戏的数据集,获得了0.87的交并比(IoU),能够从任意输入图像中生成准确的二值掩膜。基于这些掩膜,我们利用道格拉斯-普克简化算法结合自适应弧细分来提取外部轮廓顶点,并通过双边滤波多通道Canny边缘检测与跟踪放置方法沿着视觉边界找到内部顶点。最后,在使用遮罩中心点过滤的基础上进行Delaunay三角剖分生成最终网格。 在受控实验中,我们证明了直接通过神经网络热图回归预测顶点位置对于此任务是不可行的:在相同的条件下,热图解码器始终无法收敛(损失函数停滞在0.061),而分割解码器正常训练。我们认为这源于顶点放置本质上是一种艺术性的过程——同一个精灵可以通过多种不同的方式被合理地网格化。这一结果验证了我们混合设计的有效性:对于地面实况明确的情况下采用学习分割方法,而对于领域启发式适用的场景则使用算法放置的方法。 整个流程可以在不到3秒内处理完一个精灵图像,相比于手动创建而言速度提高了300到1200倍。我们向游戏开发社区开放了训练好的模型。
https://arxiv.org/abs/2602.21153
Generating crisp, i.e., one-pixel-wide, edge maps remains one of the fundamental challenges in edge detection, affecting both traditional and learning-based methods. To obtain crisp edges, most existing approaches rely on two hand-crafted post-processing algorithms, Non-Maximum Suppression (NMS) and skeleton-based thinning, which are non-differentiable and hinder end-to-end optimization. Moreover, all existing crisp edge detection methods still depend on such post-processing to achieve satisfactory results. To address this limitation, we propose \MethodLPP, a lightweight, only $\sim$21K additional parameters, and plug-and-play matching-based supervision module that can be appended to any edge detection model for joint end-to-end learning of crisp edges. At each training iteration, \MethodLPP performs one-to-one matching between predicted and ground-truth edges based on spatial distance and confidence, ensuring consistency between training and testing protocols. Extensive experiments on four popular datasets demonstrate that integrating \MethodLPP substantially improves the performance of existing edge detection models. In particular, \MethodLPP increases the Average Crispness (AC) metric by up to 2--4$\times$ compared to baseline models. Under the crispness-emphasized evaluation (CEval), \MethodLPP further boosts baseline performance by up to 20--35\% in ODS and achieves similar gains in OIS and AP, achieving SOTA performance that matches or surpasses standard post-processing for the first time. Code is available at this https URL.
生成清晰的边缘地图(即单像素宽)仍然是边缘检测中的一个基本挑战,影响着传统的和基于学习的方法。为了获取清晰的边缘,现有的大多数方法依赖于两种手工设计的后处理算法:非极大值抑制(NMS)和骨架细化。这些算法是非可微分的,阻碍了端到端优化过程。此外,所有现有的清晰边缘检测方法仍然需要这种后处理才能取得满意的结果。 为了克服这一限制,我们提出了\MethodLPP,这是一种轻量级的方法,大约只需要21K个附加参数,并且可以插入任何边缘检测模型中以进行联合的端到端学习,用于生成清晰的边缘。在每次训练迭代过程中,\MethodLPP 基于空间距离和置信度执行预测边缘与真实边缘之间的一对一匹配操作,确保了训练和测试协议之间的一致性。 我们在四个流行的数据集上进行了广泛的实验,结果表明将\MethodLPP 集成到现有的边缘检测模型中可以显著提高这些模型的性能。特别是在平均清晰度(AC)指标方面,与基线模型相比,\MethodLPP 可以提升2-4倍。在强调清晰度评估(CEval)下,\MethodLPP 还能将基线模型的表现提升最多达20%-35%在开放性数据集上,并且在开放性图像集合和平均精度方面也实现了相似的改进,这是首次达到了与标准后处理相匹配或超越其性能的标准。 代码可在以下链接获取:[提供的链接]。
https://arxiv.org/abs/2602.20689
Monocular depth and pose estimation play an important role in the development of colonoscopy-assisted navigation, as they enable improved screening by reducing blind spots, minimizing the risk of missed or recurrent lesions, and lowering the likelihood of incomplete examinations. However, this task remains challenging due to the presence of texture-less surfaces, complex illumination patterns, deformation, and a lack of in-vivo datasets with reliable ground truth. In this paper, we propose **PRISM** (Pose-Refinement with Intrinsic Shading and edge Maps), a self-supervised learning framework that leverages anatomical and illumination priors to guide geometric learning. Our approach uniquely incorporates edge detection and luminance decoupling for structural guidance. Specifically, edge maps are derived using a learning-based edge detector (e.g., DexiNed or HED) trained to capture thin and high-frequency boundaries, while luminance decoupling is obtained through an intrinsic decomposition module that separates shading and reflectance, enabling the model to exploit shading cues for depth estimation. Experimental results on multiple real and synthetic datasets demonstrate state-of-the-art performance. We further conduct a thorough ablation study on training data selection to establish best practices for pose and depth estimation in colonoscopy. This analysis yields two practical insights: (1) self-supervised training on real-world data outperforms supervised training on realistic phantom data, underscoring the superiority of domain realism over ground truth availability; and (2) video frame rate is an extremely important factor for model performance, where dataset-specific video frame sampling is necessary for generating high quality training data.
单目深度和姿态估计在结肠镜辅助导航的发展中扮演着重要角色,因为它们通过减少盲点、降低漏诊或复发病变的风险以及降低检查不完全的可能性来提高筛查效果。然而,由于无纹理表面、复杂的光照模式、变形及缺乏可靠地面真实数据的体内数据集等原因,该任务仍然具有挑战性。本文提出了一种新的方法**PRISM**(基于内在阴影和边缘图的姿态精炼),这是一种自监督学习框架,利用解剖学和照明先验指导几何学习。我们的方法独特地结合了边缘检测和亮度分离以提供结构引导。具体而言,通过使用训练有素的边缘探测器(如DexiNed或HED)来捕捉薄且高频边界,可以得到边缘图;同时通过内在分解模块获得亮度分离,该模块能够将阴影与反射分离开来,使模型利用阴影线索进行深度估计。在多个真实和合成数据集上的实验结果表明了其优越的性能表现。我们进一步进行了详尽的数据选择消融研究,以建立结肠镜检查中姿态和深度估算的最佳实践。这项分析提供了两个实用见解:(1)基于现实世界数据的自监督训练优于基于逼真仿真数据的监督训练,突显了领域真实性的优越性而非地面真相的可用性;(2)视频帧率是模型性能的一个极其重要的因素,在生成高质量训练数据时需要针对特定的数据集进行视频帧采样。
https://arxiv.org/abs/2602.17785
We propose EasyControlEdge, adapting an image-generation foundation model to edge detection. In real-world edge detection (e.g., floor-plan walls, satellite roads/buildings, and medical organ boundaries), crispness and data efficiency are crucial, yet producing crisp raw edge maps with limited training samples remains challenging. Although image-generation foundation models perform well on many downstream tasks, their pretrained priors for data-efficient transfer and iterative refinement for high-frequency detail preservation remain underexploited for edge detection. To enable crisp and data-efficient edge detection using these capabilities, we introduce an edge-specialized adaptation of image-generation foundation models. To better specialize the foundation model for edge detection, we incorporate an edge-oriented objective with an efficient pixel-space loss. At inference, we introduce guidance based on unconditional dynamics, enabling a single model to control the edge density through a guidance scale. Experiments on BSDS500, NYUDv2, BIPED, and CubiCasa compare against state-of-the-art methods and show consistent gains, particularly under no-post-processing crispness evaluation and with limited training data.
https://arxiv.org/abs/2602.16238
Analyzing animal and human behavior has long been a challenging task in computer vision. Early approaches from the 1970s to the 1990s relied on hand-crafted edge detection, segmentation, and low-level features such as color, shape, and texture to locate objects and infer their identities-an inherently ill-posed problem. Behavior analysis in this era typically proceeded by tracking identified objects over time and modeling their trajectories using sparse feature points, which further limited robustness and generalization. A major shift occurred with the introduction of ImageNet by Deng and Li in 2010, which enabled large-scale visual recognition through deep neural networks and effectively served as a comprehensive visual dictionary. This development allowed object recognition to move beyond complex low-level processing toward learned high-level representations. In this work, we follow this paradigm to build a large-scale Universal Action Space (UAS) using existing labeled human-action datasets. We then use this UAS as the foundation for analyzing and categorizing mammalian and chimpanzee behavior datasets. The source code is released on GitHub at this https URL.
从20世纪70年代到90年代,分析动物和人类的行为一直是计算机视觉中的一个难题。早期的方法依赖于手工设计的边缘检测、分割以及诸如颜色、形状和纹理等低级特征来定位物体并推断其身份——这是一个本质上难以解决的问题。在那个时代的行为分析通常是通过随时间跟踪已识别的对象,并使用稀疏特征点模型它们的轨迹来进行,这进一步限制了系统的鲁棒性和泛化能力。 随着邓和李在2010年引入ImageNet(一个大型图像数据库),情况发生了重大转变。ImageNet使大规模视觉识别成为可能,促进了深度神经网络的应用,并有效地充当了一个全面的视觉词典。这一发展使得物体识别可以超越复杂的低级处理,转向学习高级表示。在这项工作中,我们遵循这一范式,利用现有的标注人类动作数据集构建一个大规模的通用行为空间(UAS)。然后,我们将使用这个UAS作为分析和分类哺乳动物及黑猩猩行为数据集的基础。源代码可在GitHub上发布,网址为[此处提供链接]。
https://arxiv.org/abs/2602.09518
Deep learning models like U-Net and its variants, have established state-of-the-art performance in edge detection tasks and are used by Generative AI services world-wide for their image generation models. However, their decision-making processes remain opaque, operating as "black boxes" that obscure the rationale behind specific boundary predictions. This lack of transparency is a critical barrier in safety-critical applications where verification is mandatory. To bridge the gap between high-performance deep learning and interpretable logic, we propose the Rule-Based Spatial Mixture-of-Experts U-Net (sMoE U-Net). Our architecture introduces two key innovations: (1) Spatially-Adaptive Mixture-of-Experts (sMoE) blocks integrated into the decoder skip connections, which dynamically gate between "Context" (smooth) and "Boundary" (sharp) experts based on local feature statistics; and (2) a Takagi-Sugeno-Kang (TSK) Fuzzy Head that replaces the standard classification layer. This fuzzy head fuses deep semantic features with heuristic edge signals using explicit IF-THEN rules. We evaluate our method on the BSDS500 benchmark, achieving an Optimal Dataset Scale (ODS) F-score of 0.7628, effectively matching purely deep baselines like HED (0.7688) while outperforming the standard U-Net (0.7437). Crucially, our model provides pixel-level explainability through "Rule Firing Maps" and "Strategy Maps," allowing users to visualize whether an edge was detected due to strong gradients, high semantic confidence, or specific logical rule combinations.
深度学习模型如U-Net及其变体,在边缘检测任务中已达到业界领先水平,并被全球的生成式AI服务用于其图像生成模型。然而,这些模型的决策过程仍然是不透明的,“黑箱”操作使得特定边界预测背后的逻辑难以理解。这种缺乏透明度是安全关键应用中的一个重要障碍,在这些应用中必须进行验证。为了弥合高性能深度学习和可解释性逻辑之间的差距,我们提出了基于规则的空间混合专家U-Net(sMoE U-Net)。我们的架构引入了两个关键创新:(1) 将空间自适应混合专家(sMoE)块集成到解码器跳跃连接中,这些块根据局部特征统计动态地在“上下文”(平滑)和“边界”(尖锐)专家之间切换;以及 (2) 使用泰克吉-松永-康(TSK)模糊头替换标准分类层。该模糊头通过明确的IF-THEN规则融合深度语义特征与启发式边缘信号。 我们在BSDS500基准测试上评估了我们的方法,达到了最优数据规模(ODS) F分数为0.7628,在性能上接近纯深度基线模型如HED(0.7688),而优于标准U-Net(0.7437)。尤为重要的是,我们的模型通过“规则触发图”和“策略图”,在像素级提供了可解释性,使得用户能够可视化边缘是由于强梯度、高语义置信度还是特定逻辑规则组合被检测出来的。
https://arxiv.org/abs/2602.05100
Achieving robust humanoid hiking in complex, unstructured environments requires transitioning from reactive proprioception to proactive perception. However, integrating exteroception remains a significant challenge: mapping-based methods suffer from state estimation drift; for instance, LiDAR-based methods do not handle torso jitter well. Existing end-to-end approaches often struggle with scalability and training complexity; specifically, some previous works using virtual obstacles are implemented case-by-case. In this work, we present \textit{Hiking in the Wild}, a scalable, end-to-end parkour perceptive framework designed for robust humanoid hiking. To ensure safety and training stability, we introduce two key mechanisms: a foothold safety mechanism combining scalable \textit{Terrain Edge Detection} with \textit{Foot Volume Points} to prevent catastrophic slippage on edges, and a \textit{Flat Patch Sampling} strategy that mitigates reward hacking by generating feasible navigation targets. Our approach utilizes a single-stage reinforcement learning scheme, mapping raw depth inputs and proprioception directly to joint actions, without relying on external state estimation. Extensive field experiments on a full-size humanoid demonstrate that our policy enables robust traversal of complex terrains at speeds up to 2.5 m/s. The training and deployment code is open-sourced to facilitate reproducible research and deployment on real robots with minimal hardware modifications.
在复杂且未结构化的环境中实现稳健的人形机器人徒步需要从反应性的本体感觉过渡到前瞻性的感知。然而,整合外感受(exteroception)仍然是一个重大挑战:基于地图的方法会遭受状态估计漂移的困扰;例如,LiDAR方法难以处理躯干抖动的问题。现有的端到端方法通常面临可扩展性和训练复杂度的问题;一些以前使用虚拟障碍物的工作往往是特定案例实施的。在本文中,我们提出了“野外徒步”(Hiking in the Wild),这是一个针对稳健人形机器人徒步设计的、具有可扩展性的端到端跑酷感知框架。为了确保安全和训练稳定性,我们引入了两种关键机制:一种是结合可扩展性“地形边缘检测”与“足部体积点”的踏脚点安全性机制,防止在边缘发生灾难性滑动;另一种是一个通过生成可行导航目标来缓解奖励操纵的“平面块采样”策略。我们的方法采用单阶段强化学习方案,直接将原始深度输入和本体感觉映射到关节动作上,并不依赖外部状态估计。广泛的实地实验表明,在一个全尺寸的人形机器人上使用该策略可以以高达2.5米/秒的速度稳健地穿越复杂地形。训练和部署代码已开源,便于可重复的研究以及在实际机器人上的最小硬件修改下进行部署。
https://arxiv.org/abs/2601.07718
Embedded vision systems need efficient and robust image processing algorithms to perform real-time, with resource-constrained hardware. This research investigates image processing algorithms, specifically edge detection, corner detection, and blob detection, that are implemented on embedded processors, including DSPs and FPGAs. To address latency, accuracy and power consumption noted in the image processing literature, optimized algorithm architectures and quantization techniques are employed. In addition, optimal techniques for inter-frame redundancy removal and adaptive frame averaging are used to improve throughput with reasonable image quality. Simulations and hardware trials of the proposed approaches show marked improvements in the speed and energy efficiency of processing as compared to conventional implementations. The advances of this research facilitate a path for scalable and inexpensive embedded imaging systems for the automotive, surveillance, and robotics sectors, and underscore the benefit of co-designing algorithms and hardware architectures for practical real-time embedded vision applications.
嵌入式视觉系统需要高效且鲁棒的图像处理算法,以在资源受限的硬件上实现实时性能。本研究调查了在包括DSP和FPGA在内的嵌入式处理器上实现的图像处理算法,特别是边缘检测、角点检测和Blob(连通区域)检测算法。为解决图像处理文献中提到的延迟、准确性和功耗问题,采用了优化的算法架构和量化技术。此外,还使用了最佳的技术来消除帧间冗余并进行自适应帧平均,以在保证合理图像质量的前提下提高吞吐量。 对所提出方法的仿真和硬件试验表明,与传统实现相比,在处理速度和能效方面有了显著改进。本研究的进步为汽车、监控和机器人领域的可扩展且低成本的嵌入式成像系统铺平了道路,并强调了针对实际实时嵌入式视觉应用联合设计算法和硬件架构的好处。
https://arxiv.org/abs/2601.06243
Structured data extraction from tables plays a crucial role in document image analysis for scanned documents and digital archives. Although many methods have been proposed to detect table structures and extract cell contents, accurately identifying table segment boundaries (rows and columns) remains challenging, particularly in low-resolution or noisy images. In many real-world scenarios, table data are incomplete or degraded, limiting the adaptability of transformer-based methods to noisy inputs. Mask-based edge detection techniques have shown greater robustness under such conditions, as their sensitivity can be adjusted through threshold tuning; however, existing approaches typically apply masks directly to images, leading to noise sensitivity, resolution loss, or high computational cost. This paper proposes a novel multi-scale signal-processing method for detecting table edges from table masks. Row and column transitions are modeled as one-dimensional signals and processed using Gaussian convolution with progressively increasing variances, followed by statistical thresholding to suppress noise while preserving stable structural edges. Detected signal peaks are mapped back to image coordinates to obtain accurate segment boundaries. Experimental results show that applying the proposed approach to column edge detection improves Cell-Aware Segmentation Accuracy (CASA) a layout-aware metric evaluating both textual correctness and correct cell placement from 67% to 76% on the PubLayNet-1M benchmark when using TableNet with PyTesseract OCR. The method is robust to resolution variations through zero-padding and scaling strategies and produces optimized structured tabular outputs suitable for downstream analysis.
从表格中提取结构化数据在扫描文档和数字档案的图像分析中起着关键作用。尽管已经提出了许多方法来检测表结构并提取单元格内容,但在低分辨率或噪声图像中准确识别表格段边界(行和列)仍然具有挑战性。在很多现实场景中,由于表格数据不完整或退化,基于变压器的方法对噪声输入的适应能力受到限制。基于掩码的边缘检测技术在这种情况下表现出更强的鲁棒性,因为它们可以通过调整阈值来改变敏感度;然而,现有的方法通常直接将掩码应用于图像,导致了对噪声的敏感性、分辨率损失或计算成本高昂的问题。 本文提出了一种新的多尺度信号处理方法,用于从表格掩码中检测表格边缘。行和列转换被建模为一维信号,并通过使用逐渐增加方差的高斯卷积进行处理,随后采用统计阈值来抑制噪声并保留稳定的结构边缘。检测到的信号峰值会被映射回图像坐标以获得准确的段边界。 实验结果表明,在使用TableNet与PyTesseract OCR时,该方法应用于列边缘检测可以将布局感知度量(Cell-Aware Segmentation Accuracy, CASA)从67%提高至76%,CASA同时评估文本正确性和正确的单元格放置。此外,通过零填充和缩放策略,所提出的方法对分辨率变化具有鲁棒性,并生成适用于下游分析的优化结构化表格输出。
https://arxiv.org/abs/2512.21287
With the rapidly growing population of resident space objects (RSOs) in the near-Earth space environment, detailed information about their condition and capabilities is needed to provide Space Domain Awareness (SDA). Space-based sensing will enable inspection of RSOs at shorter ranges, independent of atmospheric effects, and from all aspects. The use of a sub-THz inverse synthetic aperture radar (ISAR) imaging and sensing system for SDA has been proposed in previous work, demonstrating the achievement of sub-cm image resolution at ranges of up to 100 km. This work focuses on recognition of external structures by use of sequential feature detection and tracking throughout the aligned ISAR images of the satellites. The Hough transform is employed to detect linear features, which are tracked throughout the sequence. ISAR imagery is generated via a metaheuristic simulator capable of modelling encounters for a variety of deployment scenarios. Initial frame-to-frame alignment is achieved through a series of affine transformations to facilitate later association between image features. A gradient-by-ratio method is used for edge detection within individual ISAR images, and edge magnitude and direction are subsequently used to inform a double-weighted Hough transform to detect features with high accuracy. Feature evolution during sequences of frames is analysed. It is shown that the use of feature tracking within sequences with the proposed approach will increase confidence in feature detection and classification, and an example use-case of robust detection of shadowing as a feature is presented.
随着近地空间环境中在轨航天器(RSOs)数量的迅速增长,为了提供空间领域意识(SDA),需要详细了解这些物体的状态和能力。基于太空的感应技术将使我们能够在不受大气影响的情况下,在更短的距离内从各个角度对RSOs进行检查。 之前的工作提出使用低于THz频段的逆合成孔径雷达(ISAR)成像和传感系统来实现SDA,并且该研究展示了在多达100公里范围内能够达到亚厘米级图像分辨率。这项工作重点在于通过顺序特征检测和跟踪,在对齐后的ISAR卫星图像中识别外部结构。 霍夫变换被用来检测线性特征,并在整个序列中进行追踪。ISAR影像由一个元启发式模拟器生成,该模拟器能够为各种部署场景建模相遇情况。初步的帧间对准是通过一系列仿射变换来实现的,以促进后续图像特征间的关联。 在每个单独的ISAR图像内使用梯度比率方法进行边缘检测,并利用边界的强度和方向信息来进行双加权霍夫变换,从而精确地识别特征。在整个序列中分析特性随时间的变化情况。 研究表明,在序列中使用提出的特征跟踪技术将增加对特征检测与分类的信心,并展示了通过该方法稳健地检测阴影这一特性的示例用例。
https://arxiv.org/abs/2512.15618
Underwater pipelines are highly susceptible to corrosion, which not only shorten their service life but also pose significant safety risks. Compared with manual inspection, the intelligent real-time imaging system for underwater pipeline detection has become a more reliable and practical solution. Among various underwater imaging techniques, structured light 3D imaging can restore the sufficient spatial detail for precise defect characterization. Therefore, this paper develops a multi-mode underwater structured light 3D imaging system for pipeline detection (UW-SLD system) based on multi-source information fusion. First, a rapid distortion correction (FDC) method is employed for efficient underwater image rectification. To overcome the challenges of extrinsic calibration among underwater sensors, a factor graph-based parameter optimization method is proposed to estimate the transformation matrix between the structured light and acoustic sensors. Furthermore, a multi-mode 3D imaging strategy is introduced to adapt to the geometric variability of underwater pipelines. Given the presence of numerous disturbances in underwater environments, a multi-source information fusion strategy and an adaptive extended Kalman filter (AEKF) are designed to ensure stable pose estimation and high-accuracy measurements. In particular, an edge detection-based ICP (ED-ICP) algorithm is proposed. This algorithm integrates pipeline edge detection network with enhanced point cloud registration to achieve robust and high-fidelity reconstruction of defect structures even under variable motion conditions. Extensive experiments are conducted under different operation modes, velocities, and depths. The results demonstrate that the developed system achieves superior accuracy, adaptability and robustness, providing a solid foundation for autonomous underwater pipeline detection.
水下管道极易受到腐蚀的影响,这不仅缩短了它们的服务寿命,还带来了显著的安全风险。相比人工检查,用于水下管道检测的智能实时成像系统已成为更可靠和实用的解决方案。在各种水下成像技术中,结构光3D成像能够恢复足够的空间细节以实现精确的缺陷表征。因此,本文开发了一种基于多源信息融合的水下管道检测用多重模式结构光3D成像系统(UW-SLD系统)。首先,采用快速失真校正(FDC)方法进行高效的水下图像校正。为了克服水下传感器之间的外在标定难题,提出了一种基于因子图的参数优化方法来估计结构光与声学传感器间的转换矩阵。此外,引入了多模式3D成像策略以适应水下管道的几何变化性。鉴于水下环境中存在大量干扰因素,设计了一种多源信息融合策略和自适应扩展卡尔曼滤波器(AEKF)确保稳定的姿态估计和高精度测量。特别地,提出了一种基于边缘检测的ICP(ED-ICP)算法。该算法结合了管道边缘检测网络与增强点云注册技术,在运动条件变化的情况下仍能实现缺陷结构的稳健且高度保真的重建。在不同操作模式、速度和深度下进行了广泛的实验验证,结果表明所开发系统实现了优越的精度、适应性和鲁棒性,为自主水下管道检测提供了坚实的基础。
https://arxiv.org/abs/2512.11354
We challenge the common belief that deep learning always trumps older techniques, using the example of grading Saint-Gaudens Double Eagle gold coins automatically. In our work, we put a feature-based Artificial Neural Network built around 192 custom features pulled from Sobel edge detection and HSV color analysis up against a hybrid Convolutional Neural Network that blends in EfficientNetV2, plus a straightforward Support Vector Machine as the control. Testing 1,785 coins graded by experts, the ANN nailed 86% exact matches and hit 98% when allowing a 3-grade leeway. On the flip side, CNN and SVM mostly just guessed the most common grade, scraping by with 31% and 30% exact hits. Sure, the CNN looked good on broader tolerance metrics, but that is because of some averaging trick in regression that hides how it totally flops at picking out specific grades. All told, when you are stuck with under 2,000 examples and lopsided classes, baking in real coin-expert knowledge through feature design beats out those inscrutable, all-in-one deep learning setups. This rings true for other niche quality checks where data's thin and know-how matters more than raw compute.
我们挑战了深度学习总是优于传统技术的普遍观点,通过自动评估圣高登斯双鹰金元硬币等级的例子来说明这一点。在我们的研究中,我们将一种基于192个自定义特征构建的功能型人工神经网络(ANN)与融合EfficientNetV2的混合卷积神经网络(CNN),以及作为对照组的支持向量机(SVM)进行了对比测试。这192个自定义特征是从索贝尔边缘检测和HSV颜色分析中提取出来的。我们用专家评估过的1,785枚硬币进行测试,结果发现ANN在完全匹配的情况下达到了86%的准确率,在允许±3等级偏差的情况下则达到98%的准确性。另一方面,CNN和SVM大多数时候只是猜测最常见的等级,精确匹配的准确率分别为31%和30%。虽然CNN在更宽松的标准下表现良好,但这主要是因为它在回归任务中使用了一些平均技巧来掩盖其在识别特定等级时完全失败的事实。 总的来说,在样本数量不足2,000且分类不均衡的情况下,通过设计特有硬币专家知识的功能型人工神经网络优于那些难以理解的一体化深度学习模型。这一结论同样适用于其他数据稀疏、专业知识比原始计算能力更重要的细分质量检查领域。
https://arxiv.org/abs/2512.04464
Medical image registration is crucial for various clinical and research applications including disease diagnosis or treatment planning which require alignment of images from different modalities, time points, or subjects. Traditional registration techniques often struggle with challenges such as contrast differences, spatial distortions, and modality-specific variations. To address these limitations, we propose a method that integrates learnable edge kernels with learning-based rigid and non-rigid registration techniques. Unlike conventional layers that learn all features without specific bias, our approach begins with a predefined edge detection kernel, which is then perturbed with random noise. These kernels are learned during training to extract optimal edge features tailored to the task. This adaptive edge detection enhances the registration process by capturing diverse structural features critical in medical imaging. To provide clearer insight into the contribution of each component in our design, we introduce four variant models for rigid registration and four variant models for non-rigid registration. We evaluated our approach using a dataset provided by the Medical University across three setups: rigid registration without skull removal, with skull removal, and non-rigid registration. Additionally, we assessed performance on two publicly available datasets. Across all experiments, our method consistently outperformed state-of-the-art techniques, demonstrating its potential to improve multi-modal image alignment and anatomical structure analysis.
医学图像配准对于包括疾病诊断或治疗计划在内的各种临床和研究应用至关重要,这些应用需要不同模态、时间点或受试者之间图像的对齐。传统配准技术在面对对比度差异、空间扭曲以及特定于模态的变化时常常遇到挑战。为了解决这些问题,我们提出了一种方法,该方法将可学习边缘核与基于学习的刚性和非刚性配准技术相结合。与传统的学习所有特征而不带有特定偏置的层不同,我们的方法从预定义的边缘检测核开始,并用随机噪声进行扰动。这些核在训练过程中被调整以提取针对任务的最佳边缘特征。这种自适应边缘检测通过捕获医学图像中关键的各种结构特征来增强配准过程。 为了更清楚地展示设计中的每个组件的作用,我们为刚性配准和非刚性配准分别引入了四种变体模型。我们使用某医科大学提供的数据集在三种配置下评估了我们的方法:无颅骨去除的刚性配准、有颅骨去除的刚性配准以及非刚性配准。此外,我们在两个公开可用的数据集上进行了性能评估。在所有实验中,我们的方法始终优于最先进的技术,证明其有能力改善多模态图像对齐和解剖结构分析。
https://arxiv.org/abs/2512.01771
Visual odometry techniques typically rely on feature extraction from a sequence of images and subsequent computation of optical flow. This point-to-point correspondence between two consecutive frames can be costly to compute and suffers from varying accuracy, which affects the odometry estimate's quality. Attempts have been made to bypass the difficulties originating from the correspondence problem by adopting line features and fusing other sensors (event camera, IMU) to improve performance, many of which still heavily rely on correspondence. If the camera observes a straight line as it moves, the image of the line sweeps a smooth surface in image-space time. It is a ruled surface and analyzing its shape gives information about odometry. Further, its estimation requires only differentially computed updates from point-to-line associations. Inspired by event cameras' propensity for edge detection, this research presents a novel algorithm to reconstruct 3D scenes and visual odometry from these ruled surfaces. By constraining the surfaces with the inertia measurements from an onboard IMU sensor, the dimensionality of the solution space is greatly reduced.
视觉里程计技术通常依赖于从一系列图像中提取特征,随后计算光流。这种连续两帧之间的一点对一点的对应关系难以计算且准确性不一,这会影响里程估计的质量。为克服由于对应问题带来的困难,人们采用线性特征并融合其他传感器(如事件相机、IMU)来提升性能的方法,其中很多方法仍然严重依赖于对应的准确性。 当摄像机在移动过程中观察到一条直线时,该直线的图像会在图像时间空间中扫过一个平滑的表面。这是一个可展曲面,对其形状的分析可以提供有关里程计的信息。此外,这种估计只需要从点到线关联中的差分计算更新。 受事件相机对边缘检测优势的启发,这项研究提出了一种新的算法,用于从这些可展表面上重建3D场景和视觉里程信息。通过使用机载IMU传感器提供的惯性测量数据来约束该表面,可以极大地减少解空间的维度。
https://arxiv.org/abs/2512.00327
The MARWIN robot operates at the European XFEL to perform autonomous radiation monitoring in long, monotonous accelerator tunnels where conventional localization approaches struggle. Its current navigation concept combines lidar-based edge detection, wheel/lidar odometry with periodic QR-code referencing, and fuzzy control of wall distance, rotation, and longitudinal position. While robust in predefined sections, this design lacks flexibility for unknown geometries and obstacles. This paper explores deep visual stereo odometry (DVSO) with 3D-geometric constraints as a focused alternative. DVSO is purely vision-based, leveraging stereo disparity, optical flow, and self-supervised learning to jointly estimate depth and ego-motion without labeled data. For global consistency, DVSO can subsequently be fused with absolute references (e.g., landmarks) or other sensors. We provide a conceptual evaluation for accelerator tunnel environments, using the European XFEL as a case study. Expected benefits include reduced scale drift via stereo, low-cost sensing, and scalable data collection, while challenges remain in low-texture surfaces, lighting variability, computational load, and robustness under radiation. The paper defines a research agenda toward enabling MARWIN to navigate more autonomously in constrained, safety-critical infrastructures.
MARWIN机器人在欧洲XFEL(X射线自由电子激光器)中运行,用于执行长且单调的加速器隧道中的自主辐射监测。这些地方常规定位方法难以应对。目前,MARWIN的导航概念结合了基于激光雷达的边缘检测、车轮/激光雷达里程计以及周期性二维码参考,并采用模糊控制来调节与墙壁的距离、旋转和纵向位置。尽管这种方法在预定义区域中表现稳健,但在未知几何形状和障碍物环境中缺乏灵活性。 本文探讨了一种专注于使用深度视觉立体测距(DVSO)及其3D几何约束的替代方案。DVSO是一种纯视觉方法,利用立体视差、光学流以及自我监督学习来估计深度和自身运动,无需标记数据。为了实现全局一致性,可以将DVSO与绝对参考点(如地标)或其他传感器融合。 我们使用欧洲XFEL作为案例研究,对加速器隧道环境中的这种概念进行了评估。预期的好处包括通过立体视差减少尺度漂移、低成本感应以及可扩展的数据采集能力。然而,在低纹理表面、光照变化、计算负载和辐射影响下的稳健性等问题仍待解决。 本文界定了一个研究议程,以使MARWIN能够在其受控的安全关键基础设施中实现更高级别的自主导航能力。
https://arxiv.org/abs/2512.00080