Deep learning-based edge detectors heavily rely on pixel-wise labels which are often provided by multiple annotators. Existing methods fuse multiple annotations using a simple voting process, ignoring the inherent ambiguity of edges and labeling bias of annotators. In this paper, we propose a novel uncertainty-aware edge detector (UAED), which employs uncertainty to investigate the subjectivity and ambiguity of diverse annotations. Specifically, we first convert the deterministic label space into a learnable Gaussian distribution, whose variance measures the degree of ambiguity among different annotations. Then we regard the learned variance as the estimated uncertainty of the predicted edge maps, and pixels with higher uncertainty are likely to be hard samples for edge detection. Therefore we design an adaptive weighting loss to emphasize the learning from those pixels with high uncertainty, which helps the network to gradually concentrate on the important pixels. UAED can be combined with various encoder-decoder backbones, and the extensive experiments demonstrate that UAED achieves superior performance consistently across multiple edge detection benchmarks. The source code is available at \url{this https URL}
深度学习基于的边缘检测方法通常依赖于多个标注者提供的像素级标签。现有的方法使用简单的投票过程将多个标注者提供的标注合并,而忽略边缘固有的歧义和标注者的标注偏见。在本文中,我们提出了一种新的不确定性意识到边缘检测器(UAED),该检测器使用不确定性来研究多种标注者的 Subjectivity 和歧义。具体而言,我们首先将确定性标签空间转换为可学习的高斯分布,其方差衡量不同标注者之间的歧义程度。然后,我们将学到的方差视为预测的边缘地图估计的不确定性,而高不确定性的像素可能被视为边缘检测的硬样本。因此,我们设计了一种自适应权重损失,以强调从高不确定性像素的学习,这有助于网络逐渐集中到重要的像素。 UAED 可以与各种编码器和解码器骨架相结合,广泛实验表明,UAED 在多个边缘检测基准上表现出卓越的性能。源代码可以在 \url{this https URL} 获取。
https://arxiv.org/abs/2303.11828
We study the problem of reconstructing 3D feature curves of an object from a set of calibrated multi-view images. To do so, we learn a neural implicit field representing the density distribution of 3D edges which we refer to as Neural Edge Field (NEF). Inspired by NeRF, NEF is optimized with a view-based rendering loss where a 2D edge map is rendered at a given view and is compared to the ground-truth edge map extracted from the image of that view. The rendering-based differentiable optimization of NEF fully exploits 2D edge detection, without needing a supervision of 3D edges, a 3D geometric operator or cross-view edge correspondence. Several technical designs are devised to ensure learning a range-limited and view-independent NEF for robust edge extraction. The final parametric 3D curves are extracted from NEF with an iterative optimization method. On our benchmark with synthetic data, we demonstrate that NEF outperforms existing state-of-the-art methods on all metrics. Project page: this https URL.
我们研究从一组校准多视角图像中恢复物体的3D特征曲线的问题。为了解决这个问题,我们学习了一个神经网络隐含区域,代表3D边缘密度分布,我们称之为神经网络边缘场(NEF)。受到NeRF启发,NEF使用视点渲染损失优化,其中2D边缘图在一个给定视角下渲染,并与从该视角图像中提取的基线边缘图进行比较。基于渲染的可区分优化NEF完全利用2D边缘检测,不需要对3D边缘进行监督,也不需要3D几何操作或跨视角边缘对应。几种技术设计旨在确保学习限制范围且视点独立的NEF,以 robust 3D边缘提取。最后参数化的3D曲线从NEF中通过迭代优化方法提取。在我们的合成数据基准测试中,我们证明NEF在所有指标上都比现有的先进方法表现更好。项目页面:这个https URL。
https://arxiv.org/abs/2303.07653
In recent years, compact and efficient scene understanding representations have gained popularity in increasing situational awareness and autonomy of robotic systems. In this work, we illustrate the concept of a panoptic edge segmentation and propose PENet, a novel detection network called that combines semantic edge detection and instance-level perception into a compact panoptic edge representation. This is obtained through a joint network by multi-task learning that concurrently predicts semantic edges, instance centers and offset flow map without bounding box predictions exploiting the cross-task correlations among the tasks. The proposed approach allows extending semantic edge detection to panoptic edge detection which encapsulates both category-aware and instance-aware segmentation. We validate the proposed panoptic edge segmentation method and demonstrate its effectiveness on the real-world Cityscapes dataset.
近年来,紧凑高效的场景理解表示在提高机器人系统的情境意识和自主能力方面越来越受欢迎。在这项工作中,我们介绍了PanopticEdge segmentation的概念,并提出PENet,一种 novel检测网络,它结合了语义边缘检测和实例级感知,形成了一个紧凑的PanopticEdgeRepresentation。通过共同的网络任务学习,concurrently预测语义边缘、实例中心和偏移流地图,而不需要边界框预测,利用了任务之间的跨任务correlations。我们提出的这种方法可以扩展语义边缘检测,使其涵盖了分类意识和实例意识分割。我们验证了所提出的PanopticEdge segmentation方法,并在现实城市景观数据集上展示了其有效性。
https://arxiv.org/abs/2303.08848
Achieving high-quality semantic segmentation predictions using only image-level labels enables a new level of real-world applicability. Although state-of-the-art networks deliver reliable predictions, the amount of handcrafted pixel-wise annotations to enable these results are not feasible in many real-world applications. Hence, several works have already targeted this bottleneck, using classifier-based networks like Class Activation Maps (CAMs) as a base. Addressing CAM's weaknesses of fuzzy borders and incomplete predictions, state-of-the-art approaches rely only on adding regulations to the classifier loss or using pixel-similarity-based refinement after the fact. We propose a framework that introduces an additional module using object perimeters for improved saliency. We define object perimeter information as the line separating the object and background. Our new PerimeterFit module will be applied to pre-refine the CAM predictions before using the pixel-similarity-based network. In this way, our PerimeterFit increases the quality of the CAM prediction while simultaneously improving the false negative rate. We investigated a wide range of state-of-the-art unsupervised semantic segmentation networks and edge detection techniques to create useful perimeter maps, which enable our framework to predict object locations with sharper perimeters. We achieved up to 1.5\% improvement over frameworks without our PerimeterFit module. We conduct an exhaustive analysis to illustrate that our framework enhances existing state-of-the-art frameworks for image-level-based semantic segmentation. The framework is open-source and accessible online at this https URL.
仅使用图像级别标签实现高质量的语义分割预测,可以带来现实世界适用性的新水平。尽管最先进的网络能够提供可靠的预测,但使用手工像素级别的注释来实现这些结果在许多现实世界应用中是不可行的。因此,已经有几个工作针对这一瓶颈进行了目标设定,使用像Class Activation Maps(CAMs)这样的Classifier-based网络作为基础。解决CAM的模糊边界和不完整预测的弱点,最先进的方法只能依靠添加规则到Classifier损失或事后使用像素相似性进行优化。我们提出了一个框架,该框架将引入一个额外的模块,利用物体边界来提高清晰度。我们定义物体边界信息为线条分离物体和背景。我们的新PerimeterFit模块将应用于预先RefineCAM预测之前使用像素相似性网络。通过这种方式,我们的PerimeterFit增加了CAM预测的质量,同时提高了误判率。我们研究了广泛的最先进的无监督语义分割网络和边缘检测技术,以创建有用的物体边界地图,从而使我们的框架能够预测物体位置具有更锐利的边界。我们实现了1.5\%以上的改进,没有使用我们的PerimeterFit模块。我们进行了充分的分析,以证明我们的框架改进了现有的基于图像级别语义分割的最先进的框架。框架是开源的,可以在这个httpsURL上访问。
https://arxiv.org/abs/2303.07892
We describe the development of a real-time smartphone app that allows the user to digitize paper receipts in a novel way by "waving" their phone over the receipts and letting the app automatically detect and rectify the receipts for subsequent text recognition. We show that traditional computer vision algorithms for edge and corner detection do not robustly detect the non-linear and discontinuous edges and corners of a typical paper receipt in real-world settings. This is particularly the case when the colors of the receipt and background are similar, or where other interfering rectangular objects are present. Inaccurate detection of a receipt's corner positions then results in distorted images when using an affine projective transformation to rectify the perspective. We propose an innovative solution to receipt corner detection by treating each of the four corners as a unique "object", and training a Single Shot Detection MobileNet object detection model. We use a small amount of real data and a large amount of automatically generated synthetic data that is designed to be similar to real-world imaging scenarios. We show that our proposed method robustly detects the four corners of a receipt, giving a receipt detection accuracy of 85.3% on real-world data, compared to only 36.9% with a traditional edge detection-based approach. Our method works even when the color of the receipt is virtually indistinguishable from the background. Moreover, our method is trained to detect only the corners of the central target receipt and implicitly learns to ignore other receipts, and other rectangular objects. Including synthetic data allows us to train an even better model. These factors are a major advantage over traditional edge detection-based approaches, allowing us to deliver a much better experience to the user.
我们描述了实时智能手机应用程序的开发,该应用程序以一种新颖的方式是将纸质发票的数字化,“挥手”向发票并让应用程序自动检测和纠正发票的位置,以进行后续文本识别。我们展示了传统的计算机视觉算法用于边缘和角落检测在现实世界场景中无法 robustly 检测到典型的纸质发票的非线性和离散的边缘和角落。这种情况尤其发生在发票和背景颜色相似,或者存在其他干扰的矩形物体的情况下。不准确的检测发票角落位置会导致使用阿法图新投影变换器纠正视角时产生扭曲的图像。我们提出了一种创新的解决方案,将每个角作为一个独特的“对象”,并训练一个单发检测的移动网络对象检测模型。我们使用少量的真实数据和大量的自动生成的模拟数据,设计为与现实世界图像场景相似。我们展示了我们提出的方法 robustly 检测到发票的四个角落,在现实世界数据上获得了85.3%的发票检测精度,相比之下,传统的边缘检测方法只有36.9%。我们的方法和即使发票的颜色几乎与背景相同也有效。此外,我们的方法是训练仅检测中心目标发票的四个角落,并 implicit 地学习忽略其他发票和矩形物体。包括模拟数据使我们能够训练更好的模型。这些因素是传统边缘检测方法的主要优势,使我们能够为用户提供更好的体验。
https://arxiv.org/abs/2303.05763
Point cloud sampling is a less explored research topic for this data representation. The most common sampling methods nowadays are still classical random sampling and farthest point sampling. With the development of neural networks, various methods have been proposed to sample point clouds in a task-based learning manner. However, these methods are mostly generative-based, other than selecting points directly with mathematical statistics. Inspired by the Canny edge detection algorithm for images and with the help of the attention mechanism, this paper proposes a non-generative Attention-based Point cloud Edge Sampling method (APES), which can capture the outline of input point clouds. Experimental results show that better performances are achieved with our sampling method due to the important outline information it learned.
点云采样是这个数据表示领域中研究较少的话题之一。目前,最常见的采样方法仍然是传统的随机采样和最远的点采样。随着神经网络的发展,已经提出了多种方法以任务为基础学习样本点云。但是这些方法大多基于生成式方法,除了直接使用数学统计方法选择点。受到图像中卡内基边缘检测算法的启发,并结合注意力机制,本文提出了一种非生成式的注意力基于点云边缘采样方法(APES),可以捕捉输入点云的轮廓。实验结果表明,由于它学习了重要的轮廓信息,我们的采样方法取得了更好的性能。
https://arxiv.org/abs/2302.14673
Deep learning based on unrolled algorithm has served as an effective method for accelerated magnetic resonance imaging (MRI). However, many methods ignore the direct use of edge information to assist MRI reconstruction. In this work, we present the edge-weighted pFISTA-Net that directly applies the detected edge map to the soft-thresholding part of pFISTA-Net. The soft-thresholding value of different regions will be adjusted according to the edge map. Experimental results of a public brain dataset show that the proposed yields reconstructions with lower error and better artifact suppression compared with the state-of-the-art deep learning-based methods. The edge-weighted pFISTA-Net also shows robustness for different undersampling masks and edge detection operators. In addition, we extend the edge weighted structure to joint reconstruction and segmentation network and obtain improved reconstruction performance and more accurate segmentation results.
基于展开算法的深度学习已成为加速磁共振成像(MRI)的有效方法。然而,许多方法忽略了直接使用边缘信息以协助MRI重建。在本工作中,我们介绍了边缘加权的pFISTA-Net,该网络直接应用检测到的边缘图到pFISTA-Net的软阈值部分。不同区域的软阈值值将根据边缘图进行调整。一份公开脑数据集的实验结果显示,与最先进的深度学习方法相比,提出的 reconstruction 结果具有更低的误差,更好的 artifact 抑制。边缘加权的pFISTA-Net还表现出不同 undersampling 面具和边缘检测操作的可靠性。此外,我们将边缘加权结构扩展到 joint reconstruction 和分割网络中,并提高了重建性能和更精确的分割结果。
https://arxiv.org/abs/2302.07468
There has been an increase in interest in missions that drive significantly longer distances per day than what has currently been performed. Further, some of these proposed missions require autonomous driving and absolute localization in darkness. For example, the Endurance A mission proposes to drive 1200km of its total traverse at night. The lack of natural light available during such missions limits what can be used as visual landmarks and the range at which landmarks can be observed. In order for planetary rovers to traverse long ranges, onboard absolute localization is critical to the ability of the rover to maintain its planned trajectory and avoid known hazardous regions. Currently, to accomplish absolute localization, a ground in the loop (GITL) operation is performed wherein a human operator matches local maps or images from onboard with orbital images and maps. This GITL operation limits the distance that can be driven in a day to a few hundred meters, which is the distance that the rover can maintain acceptable localization error via relative methods. Previous work has shown that using craters as landmarks is a promising approach for performing absolute localization on the moon during the day. In this work we present a method of absolute localization that utilizes craters as landmarks and matches detected crater edges on the surface with known craters in orbital maps. We focus on a localization method based on a perception system which has an external illuminator and a stereo camera. We evaluate (1) both monocular and stereo based surface crater edge detection techniques, (2) methods of scoring the crater edge matches for optimal localization, and (3) localization performance on simulated Lunar surface imagery at night. We demonstrate that this technique shows promise for maintaining absolute localization error of less than 10m required for most planetary rover missions.
https://arxiv.org/abs/2301.04630
Graph Neural Networks (GNNs) have been widely applied to different tasks such as bioinformatics, drug design, and social networks. However, recent studies have shown that GNNs are vulnerable to adversarial attacks which aim to mislead the node or subgraph classification prediction by adding subtle perturbations. Detecting these attacks is challenging due to the small magnitude of perturbation and the discrete nature of graph data. In this paper, we propose a general adversarial edge detection pipeline EDoG without requiring knowledge of the attack strategies based on graph generation. Specifically, we propose a novel graph generation approach combined with link prediction to detect suspicious adversarial edges. To effectively train the graph generative model, we sample several sub-graphs from the given graph data. We show that since the number of adversarial edges is usually low in practice, with low probability the sampled sub-graphs will contain adversarial edges based on the union bound. In addition, considering the strong attacks which perturb a large number of edges, we propose a set of novel features to perform outlier detection as the preprocessing for our detection. Extensive experimental results on three real-world graph datasets including a private transaction rule dataset from a major company and two types of synthetic graphs with controlled properties show that EDoG can achieve above 0.8 AUC against four state-of-the-art unseen attack strategies without requiring any knowledge about the attack type; and around 0.85 with knowledge of the attack type. EDoG significantly outperforms traditional malicious edge detection baselines. We also show that an adaptive attack with full knowledge of our detection pipeline is difficult to bypass it.
https://arxiv.org/abs/2212.13607
A Complete Computer vision system can be divided into two main categories: detection and classification. The Lane detection algorithm is a part of the computer vision detection category and has been applied in autonomous driving and smart vehicle systems. The lane detection system is responsible for lane marking in a complex road environment. At the same time, lane detection plays a crucial role in the warning system for a car when departs the lane. The implemented lane detection algorithm is mainly divided into two steps: edge detection and line detection. In this paper, we will compare the state-of-the-art implementation performance obtained with both FPGA and GPU to evaluate the trade-off for latency, power consumption, and utilization. Our comparison emphasises the advantages and disadvantages of the two systems.
https://arxiv.org/abs/2212.09460
Although continually extending an existing NMT model to new domains or languages has attracted intensive interest in recent years, the equally valuable problem of continually improving a given NMT model in its domain by leveraging knowledge from an unlimited number of existing NMT models is not explored yet. To facilitate the study, we propose a formal definition for the problem named knowledge accumulation for NMT (KA-NMT) with corresponding datasets and evaluation metrics and develop a novel method for KA-NMT. We investigate a novel knowledge detection algorithm to identify beneficial knowledge from existing models at token level, and propose to learn from beneficial knowledge and learn against other knowledge simultaneously to improve learning efficiency. To alleviate catastrophic forgetting, we further propose to transfer knowledge from previous to current version of the given model. Extensive experiments show that our proposed method significantly and consistently outperforms representative baselines under homogeneous, heterogeneous, and malicious model settings for different language pairs.
https://arxiv.org/abs/2212.09097
Over the past decade, there has been a significant increase in the use of Unmanned Aerial Vehicles (UAVs) to support a wide variety of missions, such as remote surveillance, vehicle tracking, and object detection. For problems involving processing of areas larger than a single image, the mosaicking of UAV imagery is a necessary step. Real-time image mosaicking is used for missions that requires fast response like search and rescue missions. It typically requires information from additional sensors, such as Global Position System (GPS) and Inertial Measurement Unit (IMU), to facilitate direct orientation, or 3D reconstruction approaches to recover the camera poses. This paper proposes a UAV-based system for real-time creation of incremental mosaics which does not require either direct or indirect camera parameters such as orientation information. Inspired by previous approaches, in the mosaicking process, feature extraction from images, matching of similar key points between images, finding homography matrix to warp and align images, and blending images to obtain mosaics better looking, plays important roles in the achievement of the high quality result. Edge detection is used in the blending step as a novel approach. Experimental results show that real-time incremental image mosaicking process can be completed satisfactorily and without need for any additional camera parameters.
https://arxiv.org/abs/2212.02302
Dunhuang murals are a collection of Chinese style and national style, forming a self-contained Chinese-style Buddhist art. It has very high historical and cultural value and research significance. Among them, the lines of Dunhuang murals are highly general and expressive. It reflects the character's distinctive character and complex inner emotions. Therefore, the outline drawing of murals is of great significance to the research of Dunhuang Culture. The contour generation of Dunhuang murals belongs to image edge detection, which is an important branch of computer vision, aims to extract salient contour information in images. Although convolution-based deep learning networks have achieved good results in image edge extraction by exploring the contextual and semantic features of images. However, with the enlargement of the receptive field, some local detail information is lost. This makes it impossible for them to generate reasonable outline drawings of murals. In this paper, we propose a novel edge detector based on self-attention combined with convolution to generate line drawings of Dunhuang murals. Compared with existing edge detection methods, firstly, a new residual self-attention and convolution mixed module (Ramix) is proposed to fuse local and global features in feature maps. Secondly, a novel densely connected backbone extraction network is designed to efficiently propagate rich edge feature information from shallow layers into deep layers. Compared with existing methods, it is shown on different public datasets that our method is able to generate sharper and richer edge maps. In addition, testing on the Dunhuang mural dataset shows that our method can achieve very competitive performance.
https://arxiv.org/abs/2212.00935
Multi-illuminant color constancy is a challenging problem with only a few existing methods. For example, one prior work used a small set of predefined white balance settings and spatially blended among them, limiting the solution to predefined illuminations. Another method proposed a generative adversarial network and an angular loss, yet the performance is suboptimal due to the lack of regularization for multi-illumination colors. This paper introduces a transformer-based multi-task learning method to estimate single and multiple light colors from a single input image. To help our deep learning model have better cues of the light colors, achromatic-pixel detection, and edge detection are used as auxiliary tasks in our multi-task learning setting. By exploiting extracted content features from the input image as tokens, illuminant color correlations between pixels are learned by leveraging contextual information in our transformer. Our transformer approach is further assisted via a contrastive loss defined between the input, output, and ground truth. We demonstrate that our proposed model achieves 40.7% improvement compared to a state-of-the-art multi-illuminant color constancy method on a multi-illuminant dataset (LSMI). Moreover, our model maintains a robust performance on the single illuminant dataset (NUS-8) and provides 22.3% improvement on the state-of-the-art single color constancy method.
https://arxiv.org/abs/2211.08772
As the availability of imagery data continues to swell, so do the demands on transmission, storage and processing power. Processing requirements to handle this plethora of data is quickly outpacing the utility of conventional processing techniques. Transitioning to quantum processing and algorithms that offer promising efficiencies over conventional methods can address some of these issues. However, to make this transformation possible, fundamental issues of implementing real time Quantum algorithms must be overcome for crucial processes needed for intelligent analysis applications. For example, consider edge detection tasks which require time-consuming acquisition processes and are further hindered by the complexity of the devices used thus limiting feasibility for implementation in real-time applications. Convolution is another example of an operation that is essential for signal and image processing applications, where the mathematical operations consist of an intelligent mixture of multiplication and addition that require considerable computational resources. This paper studies a new paired transform-based quantum representation and computation of one-dimensional and 2-D signals convolutions and gradients. A new visual data representation is defined to simplify convolution calculations making it feasible to parallelize convolution and gradient operations for more efficient performance. The new data representation is demonstrated on multiple illustrative examples for quantum edge detection, gradients, and convolution. Furthermore, the efficiency of the proposed approach is shown on real-world images.
https://arxiv.org/abs/2210.17490
Endoscopic content area refers to the informative area enclosed by the dark, non-informative, border regions present in most endoscopic footage. The estimation of the content area is a common task in endoscopic image processing and computer vision pipelines. Despite the apparent simplicity of the problem, several factors make reliable real-time estimation surprisingly challenging. The lack of rigorous investigation into the topic combined with the lack of a common benchmark dataset for this task has been a long-lasting issue in the field. In this paper, we propose two variants of a lean GPU-based computational pipeline combining edge detection and circle fitting. The two variants differ by relying on handcrafted features, and learned features respectively to extract content area edge point candidates. We also present a first-of-its-kind dataset of manually annotated and pseudo-labelled content areas across a range of surgical indications. To encourage further developments, the curated dataset, and an implementation of both algorithms, has been made public (this https URL, this https URL). We compare our proposed algorithm with a state-of-the-art U-Net-based approach and demonstrate significant improvement in terms of both accuracy (Hausdorff distance: 6.3 px versus 118.1 px) and computational time (Average runtime per frame: 0.13 ms versus 11.2 ms).
https://arxiv.org/abs/2210.14771
Extracting high-level structural information from 3D point clouds is challenging but essential for tasks like urban planning or autonomous driving requiring an advanced understanding of the scene at hand. Existing approaches are still not able to produce high-quality results consistently while being fast enough to be deployed in scenarios requiring interactivity. We propose to utilize a novel set of features describing the local neighborhood on a per-point basis via first and second order statistics as input for a simple and compact classification network to distinguish between non-edge, sharp-edge, and boundary points in the given data. Leveraging this feature embedding enables our algorithm to outperform the state-of-the-art techniques in terms of quality and processing time.
https://arxiv.org/abs/2210.13305
This work aims to integrate two learning paradigms Multi-Task Learning (MTL) and meta learning, to bring together the best of both worlds, i.e., simultaneous learning of multiple tasks, an element of MTL and promptly adapting to new tasks with fewer data, a quality of meta learning. We propose Multi-task Meta Learning (MTML), an approach to enhance MTL compared to single task learning by employing meta learning. The fundamental idea of this work is to train a multi-task model, such that when an unseen task is introduced, it can learn in fewer steps whilst offering a performance at least as good as conventional single task learning on the new task or inclusion within the MTL. By conducting various experiments, we demonstrate this paradigm on two datasets and four tasks: NYU-v2 and the taskonomy dataset for which we perform semantic segmentation, depth estimation, surface normal estimation, and edge detection. MTML achieves state-of-the-art results for most of the tasks, and MTL also performs reasonably well for all tasks compared to single task learning.
https://arxiv.org/abs/2210.06989
Inner Retinal neurons are a most essential part of the retina and they are supplied with blood via retinal vessels. This paper primarily focuses on the segmentation of retinal vessels using a triple preprocessing approach. DRIVE database was taken into consideration and preprocessed by Gabor Filtering, Gaussian Blur, and Edge Detection by Sobel and Pruning. Segmentation was driven out by 2 proposed U-Net architectures. Both the architectures were compared in terms of all the standard performance metrics. Preprocessing generated varied interesting results which impacted the results shown by the UNet architectures for segmentation. This real-time deployment can help in the efficient pre-processing of images with better segmentation and detection.
https://arxiv.org/abs/2209.11230
Visual surveillance aims to perform robust foreground object detection regardless of the time and place. Object detection shows good results using only spatial information, but foreground object detection in visual surveillance requires proper temporal and spatial information processing. In deep learning-based foreground object detection algorithms, the detection ability is superior to classical background subtraction (BGS) algorithms in an environment similar to training. However, the performance is lower than that of the classical BGS algorithm in the environment different from training. This paper proposes a spatio-temporal fusion network (STFN) that could extract temporal and spatial information using a temporal network and a spatial network. We suggest a method using a semi-foreground map for stable training of the proposed STFN. The proposed algorithm shows excellent performance in an environment different from training, and we show it through experiments with various public datasets. Also, STFN can generate a compliant background image in a semi-supervised method, and it can operate in real-time on a desktop with GPU. The proposed method shows 11.28% and 18.33% higher FM than the latest deep learning method in the LASIESTA and SBI dataset, respectively.
https://arxiv.org/abs/2209.08699