Analyzing animal and human behavior has long been a challenging task in computer vision. Early approaches from the 1970s to the 1990s relied on hand-crafted edge detection, segmentation, and low-level features such as color, shape, and texture to locate objects and infer their identities-an inherently ill-posed problem. Behavior analysis in this era typically proceeded by tracking identified objects over time and modeling their trajectories using sparse feature points, which further limited robustness and generalization. A major shift occurred with the introduction of ImageNet by Deng and Li in 2010, which enabled large-scale visual recognition through deep neural networks and effectively served as a comprehensive visual dictionary. This development allowed object recognition to move beyond complex low-level processing toward learned high-level representations. In this work, we follow this paradigm to build a large-scale Universal Action Space (UAS) using existing labeled human-action datasets. We then use this UAS as the foundation for analyzing and categorizing mammalian and chimpanzee behavior datasets. The source code is released on GitHub at this https URL.
从20世纪70年代到90年代,分析动物和人类的行为一直是计算机视觉中的一个难题。早期的方法依赖于手工设计的边缘检测、分割以及诸如颜色、形状和纹理等低级特征来定位物体并推断其身份——这是一个本质上难以解决的问题。在那个时代的行为分析通常是通过随时间跟踪已识别的对象,并使用稀疏特征点模型它们的轨迹来进行,这进一步限制了系统的鲁棒性和泛化能力。 随着邓和李在2010年引入ImageNet(一个大型图像数据库),情况发生了重大转变。ImageNet使大规模视觉识别成为可能,促进了深度神经网络的应用,并有效地充当了一个全面的视觉词典。这一发展使得物体识别可以超越复杂的低级处理,转向学习高级表示。在这项工作中,我们遵循这一范式,利用现有的标注人类动作数据集构建一个大规模的通用行为空间(UAS)。然后,我们将使用这个UAS作为分析和分类哺乳动物及黑猩猩行为数据集的基础。源代码可在GitHub上发布,网址为[此处提供链接]。
https://arxiv.org/abs/2602.09518
Deep learning models like U-Net and its variants, have established state-of-the-art performance in edge detection tasks and are used by Generative AI services world-wide for their image generation models. However, their decision-making processes remain opaque, operating as "black boxes" that obscure the rationale behind specific boundary predictions. This lack of transparency is a critical barrier in safety-critical applications where verification is mandatory. To bridge the gap between high-performance deep learning and interpretable logic, we propose the Rule-Based Spatial Mixture-of-Experts U-Net (sMoE U-Net). Our architecture introduces two key innovations: (1) Spatially-Adaptive Mixture-of-Experts (sMoE) blocks integrated into the decoder skip connections, which dynamically gate between "Context" (smooth) and "Boundary" (sharp) experts based on local feature statistics; and (2) a Takagi-Sugeno-Kang (TSK) Fuzzy Head that replaces the standard classification layer. This fuzzy head fuses deep semantic features with heuristic edge signals using explicit IF-THEN rules. We evaluate our method on the BSDS500 benchmark, achieving an Optimal Dataset Scale (ODS) F-score of 0.7628, effectively matching purely deep baselines like HED (0.7688) while outperforming the standard U-Net (0.7437). Crucially, our model provides pixel-level explainability through "Rule Firing Maps" and "Strategy Maps," allowing users to visualize whether an edge was detected due to strong gradients, high semantic confidence, or specific logical rule combinations.
深度学习模型如U-Net及其变体,在边缘检测任务中已达到业界领先水平,并被全球的生成式AI服务用于其图像生成模型。然而,这些模型的决策过程仍然是不透明的,“黑箱”操作使得特定边界预测背后的逻辑难以理解。这种缺乏透明度是安全关键应用中的一个重要障碍,在这些应用中必须进行验证。为了弥合高性能深度学习和可解释性逻辑之间的差距,我们提出了基于规则的空间混合专家U-Net(sMoE U-Net)。我们的架构引入了两个关键创新:(1) 将空间自适应混合专家(sMoE)块集成到解码器跳跃连接中,这些块根据局部特征统计动态地在“上下文”(平滑)和“边界”(尖锐)专家之间切换;以及 (2) 使用泰克吉-松永-康(TSK)模糊头替换标准分类层。该模糊头通过明确的IF-THEN规则融合深度语义特征与启发式边缘信号。 我们在BSDS500基准测试上评估了我们的方法,达到了最优数据规模(ODS) F分数为0.7628,在性能上接近纯深度基线模型如HED(0.7688),而优于标准U-Net(0.7437)。尤为重要的是,我们的模型通过“规则触发图”和“策略图”,在像素级提供了可解释性,使得用户能够可视化边缘是由于强梯度、高语义置信度还是特定逻辑规则组合被检测出来的。
https://arxiv.org/abs/2602.05100
Achieving robust humanoid hiking in complex, unstructured environments requires transitioning from reactive proprioception to proactive perception. However, integrating exteroception remains a significant challenge: mapping-based methods suffer from state estimation drift; for instance, LiDAR-based methods do not handle torso jitter well. Existing end-to-end approaches often struggle with scalability and training complexity; specifically, some previous works using virtual obstacles are implemented case-by-case. In this work, we present \textit{Hiking in the Wild}, a scalable, end-to-end parkour perceptive framework designed for robust humanoid hiking. To ensure safety and training stability, we introduce two key mechanisms: a foothold safety mechanism combining scalable \textit{Terrain Edge Detection} with \textit{Foot Volume Points} to prevent catastrophic slippage on edges, and a \textit{Flat Patch Sampling} strategy that mitigates reward hacking by generating feasible navigation targets. Our approach utilizes a single-stage reinforcement learning scheme, mapping raw depth inputs and proprioception directly to joint actions, without relying on external state estimation. Extensive field experiments on a full-size humanoid demonstrate that our policy enables robust traversal of complex terrains at speeds up to 2.5 m/s. The training and deployment code is open-sourced to facilitate reproducible research and deployment on real robots with minimal hardware modifications.
在复杂且未结构化的环境中实现稳健的人形机器人徒步需要从反应性的本体感觉过渡到前瞻性的感知。然而,整合外感受(exteroception)仍然是一个重大挑战:基于地图的方法会遭受状态估计漂移的困扰;例如,LiDAR方法难以处理躯干抖动的问题。现有的端到端方法通常面临可扩展性和训练复杂度的问题;一些以前使用虚拟障碍物的工作往往是特定案例实施的。在本文中,我们提出了“野外徒步”(Hiking in the Wild),这是一个针对稳健人形机器人徒步设计的、具有可扩展性的端到端跑酷感知框架。为了确保安全和训练稳定性,我们引入了两种关键机制:一种是结合可扩展性“地形边缘检测”与“足部体积点”的踏脚点安全性机制,防止在边缘发生灾难性滑动;另一种是一个通过生成可行导航目标来缓解奖励操纵的“平面块采样”策略。我们的方法采用单阶段强化学习方案,直接将原始深度输入和本体感觉映射到关节动作上,并不依赖外部状态估计。广泛的实地实验表明,在一个全尺寸的人形机器人上使用该策略可以以高达2.5米/秒的速度稳健地穿越复杂地形。训练和部署代码已开源,便于可重复的研究以及在实际机器人上的最小硬件修改下进行部署。
https://arxiv.org/abs/2601.07718
Embedded vision systems need efficient and robust image processing algorithms to perform real-time, with resource-constrained hardware. This research investigates image processing algorithms, specifically edge detection, corner detection, and blob detection, that are implemented on embedded processors, including DSPs and FPGAs. To address latency, accuracy and power consumption noted in the image processing literature, optimized algorithm architectures and quantization techniques are employed. In addition, optimal techniques for inter-frame redundancy removal and adaptive frame averaging are used to improve throughput with reasonable image quality. Simulations and hardware trials of the proposed approaches show marked improvements in the speed and energy efficiency of processing as compared to conventional implementations. The advances of this research facilitate a path for scalable and inexpensive embedded imaging systems for the automotive, surveillance, and robotics sectors, and underscore the benefit of co-designing algorithms and hardware architectures for practical real-time embedded vision applications.
嵌入式视觉系统需要高效且鲁棒的图像处理算法,以在资源受限的硬件上实现实时性能。本研究调查了在包括DSP和FPGA在内的嵌入式处理器上实现的图像处理算法,特别是边缘检测、角点检测和Blob(连通区域)检测算法。为解决图像处理文献中提到的延迟、准确性和功耗问题,采用了优化的算法架构和量化技术。此外,还使用了最佳的技术来消除帧间冗余并进行自适应帧平均,以在保证合理图像质量的前提下提高吞吐量。 对所提出方法的仿真和硬件试验表明,与传统实现相比,在处理速度和能效方面有了显著改进。本研究的进步为汽车、监控和机器人领域的可扩展且低成本的嵌入式成像系统铺平了道路,并强调了针对实际实时嵌入式视觉应用联合设计算法和硬件架构的好处。
https://arxiv.org/abs/2601.06243
Structured data extraction from tables plays a crucial role in document image analysis for scanned documents and digital archives. Although many methods have been proposed to detect table structures and extract cell contents, accurately identifying table segment boundaries (rows and columns) remains challenging, particularly in low-resolution or noisy images. In many real-world scenarios, table data are incomplete or degraded, limiting the adaptability of transformer-based methods to noisy inputs. Mask-based edge detection techniques have shown greater robustness under such conditions, as their sensitivity can be adjusted through threshold tuning; however, existing approaches typically apply masks directly to images, leading to noise sensitivity, resolution loss, or high computational cost. This paper proposes a novel multi-scale signal-processing method for detecting table edges from table masks. Row and column transitions are modeled as one-dimensional signals and processed using Gaussian convolution with progressively increasing variances, followed by statistical thresholding to suppress noise while preserving stable structural edges. Detected signal peaks are mapped back to image coordinates to obtain accurate segment boundaries. Experimental results show that applying the proposed approach to column edge detection improves Cell-Aware Segmentation Accuracy (CASA) a layout-aware metric evaluating both textual correctness and correct cell placement from 67% to 76% on the PubLayNet-1M benchmark when using TableNet with PyTesseract OCR. The method is robust to resolution variations through zero-padding and scaling strategies and produces optimized structured tabular outputs suitable for downstream analysis.
从表格中提取结构化数据在扫描文档和数字档案的图像分析中起着关键作用。尽管已经提出了许多方法来检测表结构并提取单元格内容,但在低分辨率或噪声图像中准确识别表格段边界(行和列)仍然具有挑战性。在很多现实场景中,由于表格数据不完整或退化,基于变压器的方法对噪声输入的适应能力受到限制。基于掩码的边缘检测技术在这种情况下表现出更强的鲁棒性,因为它们可以通过调整阈值来改变敏感度;然而,现有的方法通常直接将掩码应用于图像,导致了对噪声的敏感性、分辨率损失或计算成本高昂的问题。 本文提出了一种新的多尺度信号处理方法,用于从表格掩码中检测表格边缘。行和列转换被建模为一维信号,并通过使用逐渐增加方差的高斯卷积进行处理,随后采用统计阈值来抑制噪声并保留稳定的结构边缘。检测到的信号峰值会被映射回图像坐标以获得准确的段边界。 实验结果表明,在使用TableNet与PyTesseract OCR时,该方法应用于列边缘检测可以将布局感知度量(Cell-Aware Segmentation Accuracy, CASA)从67%提高至76%,CASA同时评估文本正确性和正确的单元格放置。此外,通过零填充和缩放策略,所提出的方法对分辨率变化具有鲁棒性,并生成适用于下游分析的优化结构化表格输出。
https://arxiv.org/abs/2512.21287
With the rapidly growing population of resident space objects (RSOs) in the near-Earth space environment, detailed information about their condition and capabilities is needed to provide Space Domain Awareness (SDA). Space-based sensing will enable inspection of RSOs at shorter ranges, independent of atmospheric effects, and from all aspects. The use of a sub-THz inverse synthetic aperture radar (ISAR) imaging and sensing system for SDA has been proposed in previous work, demonstrating the achievement of sub-cm image resolution at ranges of up to 100 km. This work focuses on recognition of external structures by use of sequential feature detection and tracking throughout the aligned ISAR images of the satellites. The Hough transform is employed to detect linear features, which are tracked throughout the sequence. ISAR imagery is generated via a metaheuristic simulator capable of modelling encounters for a variety of deployment scenarios. Initial frame-to-frame alignment is achieved through a series of affine transformations to facilitate later association between image features. A gradient-by-ratio method is used for edge detection within individual ISAR images, and edge magnitude and direction are subsequently used to inform a double-weighted Hough transform to detect features with high accuracy. Feature evolution during sequences of frames is analysed. It is shown that the use of feature tracking within sequences with the proposed approach will increase confidence in feature detection and classification, and an example use-case of robust detection of shadowing as a feature is presented.
随着近地空间环境中在轨航天器(RSOs)数量的迅速增长,为了提供空间领域意识(SDA),需要详细了解这些物体的状态和能力。基于太空的感应技术将使我们能够在不受大气影响的情况下,在更短的距离内从各个角度对RSOs进行检查。 之前的工作提出使用低于THz频段的逆合成孔径雷达(ISAR)成像和传感系统来实现SDA,并且该研究展示了在多达100公里范围内能够达到亚厘米级图像分辨率。这项工作重点在于通过顺序特征检测和跟踪,在对齐后的ISAR卫星图像中识别外部结构。 霍夫变换被用来检测线性特征,并在整个序列中进行追踪。ISAR影像由一个元启发式模拟器生成,该模拟器能够为各种部署场景建模相遇情况。初步的帧间对准是通过一系列仿射变换来实现的,以促进后续图像特征间的关联。 在每个单独的ISAR图像内使用梯度比率方法进行边缘检测,并利用边界的强度和方向信息来进行双加权霍夫变换,从而精确地识别特征。在整个序列中分析特性随时间的变化情况。 研究表明,在序列中使用提出的特征跟踪技术将增加对特征检测与分类的信心,并展示了通过该方法稳健地检测阴影这一特性的示例用例。
https://arxiv.org/abs/2512.15618
Underwater pipelines are highly susceptible to corrosion, which not only shorten their service life but also pose significant safety risks. Compared with manual inspection, the intelligent real-time imaging system for underwater pipeline detection has become a more reliable and practical solution. Among various underwater imaging techniques, structured light 3D imaging can restore the sufficient spatial detail for precise defect characterization. Therefore, this paper develops a multi-mode underwater structured light 3D imaging system for pipeline detection (UW-SLD system) based on multi-source information fusion. First, a rapid distortion correction (FDC) method is employed for efficient underwater image rectification. To overcome the challenges of extrinsic calibration among underwater sensors, a factor graph-based parameter optimization method is proposed to estimate the transformation matrix between the structured light and acoustic sensors. Furthermore, a multi-mode 3D imaging strategy is introduced to adapt to the geometric variability of underwater pipelines. Given the presence of numerous disturbances in underwater environments, a multi-source information fusion strategy and an adaptive extended Kalman filter (AEKF) are designed to ensure stable pose estimation and high-accuracy measurements. In particular, an edge detection-based ICP (ED-ICP) algorithm is proposed. This algorithm integrates pipeline edge detection network with enhanced point cloud registration to achieve robust and high-fidelity reconstruction of defect structures even under variable motion conditions. Extensive experiments are conducted under different operation modes, velocities, and depths. The results demonstrate that the developed system achieves superior accuracy, adaptability and robustness, providing a solid foundation for autonomous underwater pipeline detection.
水下管道极易受到腐蚀的影响,这不仅缩短了它们的服务寿命,还带来了显著的安全风险。相比人工检查,用于水下管道检测的智能实时成像系统已成为更可靠和实用的解决方案。在各种水下成像技术中,结构光3D成像能够恢复足够的空间细节以实现精确的缺陷表征。因此,本文开发了一种基于多源信息融合的水下管道检测用多重模式结构光3D成像系统(UW-SLD系统)。首先,采用快速失真校正(FDC)方法进行高效的水下图像校正。为了克服水下传感器之间的外在标定难题,提出了一种基于因子图的参数优化方法来估计结构光与声学传感器间的转换矩阵。此外,引入了多模式3D成像策略以适应水下管道的几何变化性。鉴于水下环境中存在大量干扰因素,设计了一种多源信息融合策略和自适应扩展卡尔曼滤波器(AEKF)确保稳定的姿态估计和高精度测量。特别地,提出了一种基于边缘检测的ICP(ED-ICP)算法。该算法结合了管道边缘检测网络与增强点云注册技术,在运动条件变化的情况下仍能实现缺陷结构的稳健且高度保真的重建。在不同操作模式、速度和深度下进行了广泛的实验验证,结果表明所开发系统实现了优越的精度、适应性和鲁棒性,为自主水下管道检测提供了坚实的基础。
https://arxiv.org/abs/2512.11354
We challenge the common belief that deep learning always trumps older techniques, using the example of grading Saint-Gaudens Double Eagle gold coins automatically. In our work, we put a feature-based Artificial Neural Network built around 192 custom features pulled from Sobel edge detection and HSV color analysis up against a hybrid Convolutional Neural Network that blends in EfficientNetV2, plus a straightforward Support Vector Machine as the control. Testing 1,785 coins graded by experts, the ANN nailed 86% exact matches and hit 98% when allowing a 3-grade leeway. On the flip side, CNN and SVM mostly just guessed the most common grade, scraping by with 31% and 30% exact hits. Sure, the CNN looked good on broader tolerance metrics, but that is because of some averaging trick in regression that hides how it totally flops at picking out specific grades. All told, when you are stuck with under 2,000 examples and lopsided classes, baking in real coin-expert knowledge through feature design beats out those inscrutable, all-in-one deep learning setups. This rings true for other niche quality checks where data's thin and know-how matters more than raw compute.
我们挑战了深度学习总是优于传统技术的普遍观点,通过自动评估圣高登斯双鹰金元硬币等级的例子来说明这一点。在我们的研究中,我们将一种基于192个自定义特征构建的功能型人工神经网络(ANN)与融合EfficientNetV2的混合卷积神经网络(CNN),以及作为对照组的支持向量机(SVM)进行了对比测试。这192个自定义特征是从索贝尔边缘检测和HSV颜色分析中提取出来的。我们用专家评估过的1,785枚硬币进行测试,结果发现ANN在完全匹配的情况下达到了86%的准确率,在允许±3等级偏差的情况下则达到98%的准确性。另一方面,CNN和SVM大多数时候只是猜测最常见的等级,精确匹配的准确率分别为31%和30%。虽然CNN在更宽松的标准下表现良好,但这主要是因为它在回归任务中使用了一些平均技巧来掩盖其在识别特定等级时完全失败的事实。 总的来说,在样本数量不足2,000且分类不均衡的情况下,通过设计特有硬币专家知识的功能型人工神经网络优于那些难以理解的一体化深度学习模型。这一结论同样适用于其他数据稀疏、专业知识比原始计算能力更重要的细分质量检查领域。
https://arxiv.org/abs/2512.04464
Medical image registration is crucial for various clinical and research applications including disease diagnosis or treatment planning which require alignment of images from different modalities, time points, or subjects. Traditional registration techniques often struggle with challenges such as contrast differences, spatial distortions, and modality-specific variations. To address these limitations, we propose a method that integrates learnable edge kernels with learning-based rigid and non-rigid registration techniques. Unlike conventional layers that learn all features without specific bias, our approach begins with a predefined edge detection kernel, which is then perturbed with random noise. These kernels are learned during training to extract optimal edge features tailored to the task. This adaptive edge detection enhances the registration process by capturing diverse structural features critical in medical imaging. To provide clearer insight into the contribution of each component in our design, we introduce four variant models for rigid registration and four variant models for non-rigid registration. We evaluated our approach using a dataset provided by the Medical University across three setups: rigid registration without skull removal, with skull removal, and non-rigid registration. Additionally, we assessed performance on two publicly available datasets. Across all experiments, our method consistently outperformed state-of-the-art techniques, demonstrating its potential to improve multi-modal image alignment and anatomical structure analysis.
医学图像配准对于包括疾病诊断或治疗计划在内的各种临床和研究应用至关重要,这些应用需要不同模态、时间点或受试者之间图像的对齐。传统配准技术在面对对比度差异、空间扭曲以及特定于模态的变化时常常遇到挑战。为了解决这些问题,我们提出了一种方法,该方法将可学习边缘核与基于学习的刚性和非刚性配准技术相结合。与传统的学习所有特征而不带有特定偏置的层不同,我们的方法从预定义的边缘检测核开始,并用随机噪声进行扰动。这些核在训练过程中被调整以提取针对任务的最佳边缘特征。这种自适应边缘检测通过捕获医学图像中关键的各种结构特征来增强配准过程。 为了更清楚地展示设计中的每个组件的作用,我们为刚性配准和非刚性配准分别引入了四种变体模型。我们使用某医科大学提供的数据集在三种配置下评估了我们的方法:无颅骨去除的刚性配准、有颅骨去除的刚性配准以及非刚性配准。此外,我们在两个公开可用的数据集上进行了性能评估。在所有实验中,我们的方法始终优于最先进的技术,证明其有能力改善多模态图像对齐和解剖结构分析。
https://arxiv.org/abs/2512.01771
Visual odometry techniques typically rely on feature extraction from a sequence of images and subsequent computation of optical flow. This point-to-point correspondence between two consecutive frames can be costly to compute and suffers from varying accuracy, which affects the odometry estimate's quality. Attempts have been made to bypass the difficulties originating from the correspondence problem by adopting line features and fusing other sensors (event camera, IMU) to improve performance, many of which still heavily rely on correspondence. If the camera observes a straight line as it moves, the image of the line sweeps a smooth surface in image-space time. It is a ruled surface and analyzing its shape gives information about odometry. Further, its estimation requires only differentially computed updates from point-to-line associations. Inspired by event cameras' propensity for edge detection, this research presents a novel algorithm to reconstruct 3D scenes and visual odometry from these ruled surfaces. By constraining the surfaces with the inertia measurements from an onboard IMU sensor, the dimensionality of the solution space is greatly reduced.
视觉里程计技术通常依赖于从一系列图像中提取特征,随后计算光流。这种连续两帧之间的一点对一点的对应关系难以计算且准确性不一,这会影响里程估计的质量。为克服由于对应问题带来的困难,人们采用线性特征并融合其他传感器(如事件相机、IMU)来提升性能的方法,其中很多方法仍然严重依赖于对应的准确性。 当摄像机在移动过程中观察到一条直线时,该直线的图像会在图像时间空间中扫过一个平滑的表面。这是一个可展曲面,对其形状的分析可以提供有关里程计的信息。此外,这种估计只需要从点到线关联中的差分计算更新。 受事件相机对边缘检测优势的启发,这项研究提出了一种新的算法,用于从这些可展表面上重建3D场景和视觉里程信息。通过使用机载IMU传感器提供的惯性测量数据来约束该表面,可以极大地减少解空间的维度。
https://arxiv.org/abs/2512.00327
The MARWIN robot operates at the European XFEL to perform autonomous radiation monitoring in long, monotonous accelerator tunnels where conventional localization approaches struggle. Its current navigation concept combines lidar-based edge detection, wheel/lidar odometry with periodic QR-code referencing, and fuzzy control of wall distance, rotation, and longitudinal position. While robust in predefined sections, this design lacks flexibility for unknown geometries and obstacles. This paper explores deep visual stereo odometry (DVSO) with 3D-geometric constraints as a focused alternative. DVSO is purely vision-based, leveraging stereo disparity, optical flow, and self-supervised learning to jointly estimate depth and ego-motion without labeled data. For global consistency, DVSO can subsequently be fused with absolute references (e.g., landmarks) or other sensors. We provide a conceptual evaluation for accelerator tunnel environments, using the European XFEL as a case study. Expected benefits include reduced scale drift via stereo, low-cost sensing, and scalable data collection, while challenges remain in low-texture surfaces, lighting variability, computational load, and robustness under radiation. The paper defines a research agenda toward enabling MARWIN to navigate more autonomously in constrained, safety-critical infrastructures.
MARWIN机器人在欧洲XFEL(X射线自由电子激光器)中运行,用于执行长且单调的加速器隧道中的自主辐射监测。这些地方常规定位方法难以应对。目前,MARWIN的导航概念结合了基于激光雷达的边缘检测、车轮/激光雷达里程计以及周期性二维码参考,并采用模糊控制来调节与墙壁的距离、旋转和纵向位置。尽管这种方法在预定义区域中表现稳健,但在未知几何形状和障碍物环境中缺乏灵活性。 本文探讨了一种专注于使用深度视觉立体测距(DVSO)及其3D几何约束的替代方案。DVSO是一种纯视觉方法,利用立体视差、光学流以及自我监督学习来估计深度和自身运动,无需标记数据。为了实现全局一致性,可以将DVSO与绝对参考点(如地标)或其他传感器融合。 我们使用欧洲XFEL作为案例研究,对加速器隧道环境中的这种概念进行了评估。预期的好处包括通过立体视差减少尺度漂移、低成本感应以及可扩展的数据采集能力。然而,在低纹理表面、光照变化、计算负载和辐射影响下的稳健性等问题仍待解决。 本文界定了一个研究议程,以使MARWIN能够在其受控的安全关键基础设施中实现更高级别的自主导航能力。
https://arxiv.org/abs/2512.00080
Fast and accurate video object recognition, which relies on frame-by-frame video analytics, remains a challenge for resource-constrained devices such as traffic cameras. Recent advances in mobile edge computing have made it possible to offload computation-intensive object detection to edge servers equipped with high-accuracy neural networks, while lightweight and fast object tracking algorithms run locally on devices. This hybrid approach offers a promising solution but introduces a new challenge: deciding when to perform edge detection versus local tracking. To address this, we formulate two long-term optimization problems for both single-device and multi-device scenarios, taking into account the temporal correlation of consecutive frames and the dynamic conditions of mobile edge networks. Based on the formulation, we propose the LTED-Ada in single-device setting, a deep reinforcement learning-based algorithm that adaptively selects between local tracking and edge detection, according to the frame rate as well as recognition accuracy and delay requirement. In multi-device setting, we further enhance LTED-Ada using federated learning to enable collaborative policy training across devices, thereby improving its generalization to unseen frame rates and performance requirements. Finally, we conduct extensive hardware-in-the-loop experiments using multiple Raspberry Pi 4B devices and a personal computer as the edge server, demonstrating the superiority of LTED-Ada.
https://arxiv.org/abs/2511.20716
Monocular simultaneous localization and mapping (SLAM) algorithms estimate drone poses and build a 3D map using a single camera. Current algorithms include sparse methods that lack detailed geometry, while learning-driven approaches produce dense maps but are computationally intensive. Monocular SLAM also faces scale ambiguities, which affect its accuracy. To address these challenges, we propose an edge-aware lightweight monocular SLAM system combining sparse keypoint-based pose estimation with dense edge reconstruction. Our method employs deep learning-based depth prediction and edge detection, followed by optimization to refine keypoints and edges for geometric consistency, without relying on global loop closure or heavy neural computations. We fuse inertial data with vision by using an extended Kalman filter to resolve scale ambiguity and improve accuracy. The system operates in real time on low-power platforms, as demonstrated on a DJI Tello drone with a monocular camera and inertial sensors. In addition, we demonstrate robust autonomous navigation and obstacle avoidance in indoor corridors and on the TUM RGBD dataset. Our approach offers an effective, practical solution to real-time mapping and navigation in resource-constrained environments.
https://arxiv.org/abs/2511.14335
Chess has experienced a large increase in viewership since the pandemic, driven largely by the accessibility of online learning platforms. However, no equivalent assistance exists for physical chess games, creating a divide between analog and digital chess experiences. This paper presents CVChess, a deep learning framework for converting chessboard images to Forsyth-Edwards Notation (FEN), which is later input into online chess engines to provide you with the best next move. Our approach employs a convolutional neural network (CNN) with residual layers to perform piece recognition from smartphone camera images. The system processes RGB images of a physical chess board through a multistep process: image preprocessing using the Hough Line Transform for edge detection, projective transform to achieve a top-down board alignment, segmentation into 64 individual squares, and piece classification into 13 classes (6 unique white pieces, 6 unique black pieces and an empty square) using the residual CNN. Residual connections help retain low-level visual features while enabling deeper feature extraction, improving accuracy and stability during training. We train and evaluate our model using the Chess Recognition Dataset (ChessReD), containing 10,800 annotated smartphone images captured under diverse lighting conditions and angles. The resulting classifications are encoded as an FEN string, which can be fed into a chess engine to generate the most optimal move
https://arxiv.org/abs/2511.11522
Understanding and mitigating flicker effects caused by rapid variations in light intensity is critical for enhancing the performance of event cameras in diverse environments. This paper introduces an innovative autonomous mechanism for tuning the biases of event cameras, effectively addressing flicker across a wide frequency range -25 Hz to 500 Hz. Unlike traditional methods that rely on additional hardware or software for flicker filtering, our approach leverages the event cameras inherent bias settings. Utilizing a simple Convolutional Neural Networks -CNNs, the system identifies instances of flicker in a spatial space and dynamically adjusts specific biases to minimize its impact. The efficacy of this autobiasing system was robustly tested using a face detector framework under both well-lit and low-light conditions, as well as across various frequencies. The results demonstrated significant improvements: enhanced YOLO confidence metrics for face detection, and an increased percentage of frames capturing detected faces. Moreover, the average gradient, which serves as an indicator of flicker presence through edge detection, decreased by 38.2 percent in well-lit conditions and by 53.6 percent in low-light conditions. These findings underscore the potential of our approach to significantly improve the functionality of event cameras in a range of adverse lighting scenarios.
https://arxiv.org/abs/2511.02180
Reconstruction of an object from points cloud is essential in prosthetics, medical imaging, computer vision, etc. We present an effective algorithm for an Allen--Cahn-type model of reconstruction, employing the Lagrange multiplier approach. Utilizing scattered data points from an object, we reconstruct a narrow shell by solving the governing equation enhanced with an edge detection function derived from the unsigned distance function. The specifically designed edge detection function ensures the energy stability. By reformulating the governing equation through the Lagrange multiplier technique and implementing a Crank--Nicolson time discretization, we can update the solutions in a stable and decoupled manner. The spatial operations are approximated using the finite difference method, and we analytically demonstrate the unconditional stability of the fully discrete scheme. Comprehensive numerical experiments, including reconstructions of complex 3D volumes such as characters from \textit{Star Wars}, validate the algorithm's accuracy, stability, and effectiveness. Additionally, we analyze how specific parameter selections influence the level of detail and refinement in the reconstructed volumes. To facilitate the interested readers to understand our algorithm, we share the computational codes and data in this https URL.
https://arxiv.org/abs/2511.00508
Autonomous Vehicles (AVs) are transforming the future of transportation through advances in intelligent perception, decision-making, and control systems. However, their success is tied to one core capability, reliable object detection in complex and multimodal environments. While recent breakthroughs in Computer Vision (CV) and Artificial Intelligence (AI) have driven remarkable progress, the field still faces a critical challenge as knowledge remains fragmented across multimodal perception, contextual reasoning, and cooperative intelligence. This survey bridges that gap by delivering a forward-looking analysis of object detection in AVs, emphasizing emerging paradigms such as Vision-Language Models (VLMs), Large Language Models (LLMs), and Generative AI rather than re-examining outdated techniques. We begin by systematically reviewing the fundamental spectrum of AV sensors (camera, ultrasonic, LiDAR, and Radar) and their fusion strategies, highlighting not only their capabilities and limitations in dynamic driving environments but also their potential to integrate with recent advances in LLM/VLM-driven perception frameworks. Next, we introduce a structured categorization of AV datasets that moves beyond simple collections, positioning ego-vehicle, infrastructure-based, and cooperative datasets (e.g., V2V, V2I, V2X, I2I), followed by a cross-analysis of data structures and characteristics. Ultimately, we analyze cutting-edge detection methodologies, ranging from 2D and 3D pipelines to hybrid sensor fusion, with particular attention to emerging transformer-driven approaches powered by Vision Transformers (ViTs), Large and Small Language Models (SLMs), and VLMs. By synthesizing these perspectives, our survey delivers a clear roadmap of current capabilities, open challenges, and future opportunities.
https://arxiv.org/abs/2510.26641
The edge detection task is essential in image processing aiming to extract relevant information from an image. One recurring problem in this task is the weaknesses found in some detectors, such as the difficulty in detecting loose edges and the lack of context to extract relevant information from specific problems. To address these weaknesses and adapt the detector to the properties of an image, an adaptable detector described by two-dimensional cellular automaton and optimized by meta-heuristic combined with transfer learning techniques was developed. This study aims to analyze the impact of expanding the search space of the optimization phase and the robustness of the adaptability of the detector in identifying edges of a set of natural images and specialized subsets extracted from the same image set. The results obtained prove that expanding the search space of the optimization phase was not effective for the chosen image set. The study also analyzed the adaptability of the model through a series of experiments and validation techniques and found that, regardless of the validation, the model was able to adapt to the input and the transfer learning techniques applied to the model showed no significant improvements.
https://arxiv.org/abs/2510.26509
Intelligent vehicles are one of the most important outcomes gained from the world tendency toward automation. Applications of IVs, whether in urban roads or robot tracks, do prioritize lane path detection. This paper proposes an FPGA-based Lane Detector Vehicle LDV architecture that relies on the Sobel algorithm for edge detection. Operating on 416 x 416 images and 150 MHz, the system can generate a valid output every 1.17 ms. The valid output consists of the number of present lanes, the current lane index, as well as its right and left boundaries. Additionally, the automated light and temperature control units in the proposed system enhance its adaptability to the surrounding environmental conditions.
https://arxiv.org/abs/2510.24778
This study introduces a modular framework for spatial image processing, integrating grayscale quantization, color and brightness enhancement, image sharpening, bidirectional transformation pipelines, and geometric feature extraction. A stepwise intensity transformation quantizes grayscale images into eight discrete levels, producing a posterization effect that simplifies representation while preserving structural detail. Color enhancement is achieved via histogram equalization in both RGB and YCrCb color spaces, with the latter improving contrast while maintaining chrominance fidelity. Brightness adjustment is implemented through HSV value-channel manipulation, and image sharpening is performed using a 3 * 3 convolution kernel to enhance high-frequency details. A bidirectional transformation pipeline that integrates unsharp masking, gamma correction, and noise amplification achieved accuracy levels of 76.10% and 74.80% for the forward and reverse processes, respectively. Geometric feature extraction employed Canny edge detection, Hough-based line estimation (e.g., 51.50° for billiard cue alignment), Harris corner detection, and morphological window localization. Cue isolation further yielded 81.87\% similarity against ground truth images. Experimental evaluation across diverse datasets demonstrates robust and deterministic performance, highlighting its potential for real-time image analysis and computer vision.
这项研究介绍了一个用于空间图像处理的模块化框架,集成了灰度量化、颜色和亮度增强、图像锐化、双向转换管道以及几何特征提取。该框架通过逐步强度变换将灰度图像量化为八个离散级别,产生一种海报化效果,在简化表示的同时保持结构细节。颜色增强是通过对RGB和YCrCb色彩空间的直方图均衡实现的,后者提高了对比度并保留了色度保真度。亮度调整是通过HSV值通道操作来实施的,并且图像锐化采用3x3卷积核进行高频细节的增强。 双向转换管道整合了不清晰掩模、伽马校正和噪声放大技术,在前向和反向过程中分别达到了76.10%和74.80%的精度水平。几何特征提取使用Canny边缘检测、基于霍夫变换的直线估计(例如,台球杆对齐角度为51.50°)、Harris角点检测以及形态学窗口定位。 台球杆隔离进一步与地面真实图像相比达到了81.87%的相似度。跨多种数据集进行实验评估证明了其稳健和确定性的性能表现,突显出其在实时图像分析和计算机视觉领域的潜在应用价值。
https://arxiv.org/abs/2510.08449