In Robot-Assisted Minimally Invasive Surgery (RAMIS), a camera assistant is normally required to control the position and zooming ratio of the laparoscope, following the surgeon's instructions. However, moving the laparoscope frequently may lead to unstable and suboptimal views, while the adjustment of zooming ratio may interrupt the workflow of the surgical operation. To this end, we propose a multi-scale Generative Adversarial Network (GAN)-based video super-resolution method to construct a framework for automatic zooming ratio adjustment. It can provide automatic real-time zooming for high-quality visualization of the Region Of Interest (ROI) during the surgical operation. In the pipeline of the framework, the Kernel Correlation Filter (KCF) tracker is used for tracking the tips of the surgical tools, while the Semi-Global Block Matching (SGBM) based depth estimation and Recurrent Neural Network (RNN)-based context-awareness are developed to determine the upscaling ratio for zooming. The framework is validated with the JIGSAW dataset and Hamlyn Centre Laparoscopic/Endoscopic Video Datasets, with results demonstrating its practicability.
https://arxiv.org/abs/2011.04003
Deep networks for stereo matching typically leverage 2D or 3D convolutional encoder-decoder architectures to aggregate cost and regularize the cost volume for accurate disparity estimation. Due to content-insensitive convolutions and down-sampling and up-sampling operations, these cost aggregation mechanisms do not take full advantage of the information available in the images. Disparity maps suffer from over-smoothing near occlusion boundaries, and erroneous predictions in thin structures. In this paper, we show how deep adaptive filtering and differentiable semi-global aggregation can be integrated in existing 2D and 3D convolutional networks for end-to-end stereo matching, leading to improved accuracy. The improvements are due to utilizing RGB information from the images as a signal to dynamically guide the matching process, in addition to being the signal we attempt to match across the images. We show extensive experimental results on the KITTI 2015 and Virtual KITTI 2 datasets comparing four stereo networks (DispNetC, GCNet, PSMNet and GANet) after integrating four adaptive filters (segmentation-aware bilateral filtering, dynamic filtering networks, pixel adaptive convolution and semi-global aggregation) into their architectures. Our code is available at this https URL.
https://arxiv.org/abs/2010.07350
We propose a novel lightweight network for stereo estimation. Our network consists of a fully-convolutional densely connected neural network (FC-DCNN) that computes matching costs between rectified image pairs. Our FC-DCNN method learns expressive features and performs some simple but effective post-processing steps. The densely connected layer structure connects the output of each layer to the input of each subsequent layer. This network structure and the fact that we do not use any fully-connected layers or 3D convolutions leads to a very lightweight network. The output of this network is used in order to calculate matching costs and create a cost-volume. Instead of using time and memory-inefficient cost-aggregation methods such as semi-global matching or conditional random fields in order to improve the result, we rely on filtering techniques, namely median filter and guided filter. By computing a left-right consistency check we get rid of inconsistent values. Afterwards we use a watershed foreground-background segmentation on the disparity image with removed inconsistencies. This mask is then used to refine the final prediction. We show that our method works well for both challenging indoor and outdoor scenes by evaluating it on the Middlebury, KITTI and ETH3D benchmarks respectively. Our full framework is available at this https URL
https://arxiv.org/abs/2010.06950
In this paper, we focus on estimating the 6D pose of objects in point clouds. Although the topic has been widely studied, pose estimation in point clouds remains a challenging problem due to the noise and occlusion. To address the problem, a novel 3DPVNet is presented in this work, which utilizes 3D local patches to vote for the object 6D poses. 3DPVNet is comprised of three modules. In particular, a Patch Unification (\textbf{PU}) module is first introduced to normalize the input patch, and also create a standard local coordinate frame on it to generate a reliable vote. We then devise a Weight-guided Neighboring Feature Fusion (\textbf{WNFF}) module in the network, which fuses the neighboring features to yield a semi-global feature for the center patch. WNFF module mines the neighboring information of a local patch, such that the representation capability to local geometric characteristics is significantly enhanced, making the method robust to a certain level of noise. Moreover, we present a Patch-level Voting (\textbf{PV}) module to regress transformations and generates pose votes. After the aggregation of all votes from patches and a refinement step, the final pose of the object can be obtained. Compared to recent voting-based methods, 3DPVNet is patch-level, and directly carried out on point clouds. Therefore, 3DPVNet achieves less computation than point/pixel-level voting scheme, and has robustness to partial data. Experiments on several datasets demonstrate that 3DPVNet achieves the state-of-the-art performance, and is also robust against noise and occlusions.
https://arxiv.org/abs/2009.06887
Depth-map is the key computation in computer vision and robotics. One of the most popular approach is via computation of disparity-map of images obtained from Stereo Camera. Semi Global Matching (SGM) method is a popular choice for good accuracy with reasonable computation time. To use such compute-intensive algorithms for real-time applications such as for autonomous aerial vehicles, blind Aid, etc. acceleration using GPU, FPGA is necessary. In this paper, we show the design and implementation of a stereo-vision system, which is based on FPGA-implementation of More Global Matching(MGM). MGM is a variant of SGM. We use 4 paths but store a single cumulative cost value for a corresponding pixel. Our stereo-vision prototype uses Zedboard containing an ARM-based Zynq-SoC, ZED-stereo-camera / ELP stereo-camera / Intel RealSense D435i, and VGA for visualization. The power consumption attributed to the custom FPGA-based acceleration of disparity map computation required for depth-map is just 0.72 watt. The update rate of the disparity map is realistic 10.5 fps.
https://arxiv.org/abs/2007.03269
In this paper, we present a novel linear algorithm to estimate the 6 DoF relative pose from consecutive frames of stereo rolling shutter (RS) cameras. Our method is derived based on the assumption that stereo cameras undergo motion with constant velocity around the center of the baseline, which needs 9 pairs of correspondences on both left and right consecutive frames. The stereo RS images enable the recovery of depth maps from the semi-global matching (SGM) algorithm. With the estimated camera motion and depth map, we can correct the RS images to get the undistorted images without any scene structure assumption. Experiments on both simulated points and synthetic RS images demonstrate the effectiveness of our algorithm in relative pose estimation.
https://arxiv.org/abs/2006.07807
This paper tackles the problem of data abstraction in the context of 3D point sets. Our method classifies points into different geometric primitives, such as planes and cones, leading to a compact representation of the data. Being based on a semi-global Hough voting scheme, the method does not need initialization and is robust, accurate, and efficient. We use a local, low-dimensional parameterization of primitives to determine type, shape and pose of the object that a point belongs to. This makes our algorithm suitable to run on devices with low computational power, as often required in robotics applications. The evaluation shows that our method outperforms state-of-the-art methods both in terms of accuracy and robustness.
https://arxiv.org/abs/2005.07457
This work presents dense stereo reconstruction using high-resolution images for infrastructure inspections. The state-of-the-art stereo reconstruction methods, both learning and non-learning ones, consume too much computational resource on high-resolution data. Recent learning-based methods achieve top ranks on most benchmarks. However, they suffer from the generalization issue due to lack of task-specific training data. We propose to use a less resource demanding non-learning method, guided by a learning-based model, to handle high-resolution images and achieve accurate stereo reconstruction. The deep-learning model produces an initial disparity prediction with uncertainty for each pixel of the down-sampled stereo image pair. The uncertainty serves as a self-measurement of its generalization ability and the per-pixel searching range around the initially predicted disparity. The downstream process performs a modified version of the Semi-Global Block Matching method with the up-sampled per-pixel searching range. The proposed deep-learning assisted method is evaluated on the Middlebury dataset and high-resolution stereo images collected by our customized binocular stereo camera. The combination of learning and non-learning methods achieves better performance on 12 out of 15 cases of the Middlebury dataset. In our infrastructure inspection experiments, the average 3D reconstruction error is less than 0.004m.
https://arxiv.org/abs/1912.05012
Despite the availability of many Markov Random Field (MRF) optimization algorithms, their widespread usage is currently limited due to imperfect MRF modelling arising from hand-crafted model parameters. In addition to differentiability, the two main aspects that enable learning these model parameters are the forward and backward propagation time of the MRF optimization algorithm and its parallelization capabilities. In this work, we introduce two fast and differentiable message passing algorithms, namely, Iterative Semi-Global Matching Revised (ISGMR) and Parallel Tree-Reweighted Message Passing (TRWP) which are greatly sped up on GPU by exploiting massive parallelism. Specifically, ISGMR is an iterative and revised version of the standard SGM for general second-order MRFs with improved optimization effectiveness, whereas TRWP is a highly parallelizable version of Sequential TRW (TRWS) for faster optimization. Our experiments on standard stereo benchmarks demonstrate that ISGMR achieves much lower energies than SGM and TRWP is two orders of magnitude faster than TRWS without losing effectiveness in optimization. Furthermore, our CUDA implementations are at least 7 and 650 times faster than PyTorch GPU implementations in the forward and backward propagation, respectively, enabling efficient end-to-end learning with message passing.
https://arxiv.org/abs/1910.10892
Online augmentation of an oblique aerial image sequence with structural information is an essential aspect in the process of 3D scene interpretation and analysis. One key aspect in this is the efficient dense image matching and depth estimation. Here, the Semi-Global Matching (SGM) approach has proven to be one of the most widely used algorithms for efficient depth estimation, providing a good trade-off between accuracy and computational complexity. However, SGM only models a first-order smoothness assumption, thus favoring fronto-parallel surfaces. In this work, we present a hierarchical algorithm that allows for efficient depth and normal map estimation together with confidence measures for each estimate. Our algorithm relies on a plane-sweep multi-image matching followed by an extended SGM optimization that allows to incorporate local surface orientations, thus achieving more consistent and accurate estimates in areasmade up of slanted surfaces, inherent to oblique aerial imagery. We evaluate numerous configurations of our algorithm on two different datasets using an absolute and relative accuracy measure. In our evaluation, we show that the results of our approach are comparable to the ones achieved by refined Structure-from-Motion (SfM) pipelines, such as COLMAP, which are designed for offline processing. In contrast, however, our approach only considers a confined image bundle of an input sequence, thus allowing to perform an online and incremental computation at 1Hz-2Hz.
https://arxiv.org/abs/1909.09891
Recently, there has been growing interest in developing learning-based methods to detect and utilize salient semi-global or global structures, such as junctions, lines, planes, cuboids, smooth surfaces, and all types of symmetries, for 3D scene modeling and understanding. However, the ground truth annotations are often obtained via human labor, which is particularly challenging and inefficient for such tasks due to the large number of 3D structure instances (e.g., line segments) and other factors such as viewpoints and occlusions. In this paper, we present a new synthetic dataset, Structured3D, with the aim to providing large-scale photo-realistic images with rich 3D structure annotations for a wide spectrum of structured 3D modeling tasks. We take advantage of the availability of millions of professional interior designs and automatically extract 3D structures from them. We generate high-quality images with an industry-leading rendering engine. We use our synthetic dataset in combination with real images to train deep neural networks for room layout estimation and demonstrate improved performance on benchmark datasets.
https://arxiv.org/abs/1908.00222
Running time of the light field depth estimation algorithms is typically high. This assessment is based on the computational complexity of existing methods and the large amounts of data involved. The aim of our work is to develop a simple and fast algorithm for accurate depth computation. In this context, we propose an approach, which involves Semi-Global Matching for the processing of light field images. It forms on comparison of pixels' correspondences with different metrics in the substantially bounded light field space. We show that our method is suitable for the fast production of a proper result in a variety of light field configurations
https://arxiv.org/abs/1907.13449
Nowadays dense stereo matching has become one of the dominant tools in 3D reconstruction of urban regions for its low cost and high flexibility in generating dense 3D points. However, state-of-the-art stereo matching algorithms usually apply a semi-global matching (SGM) strategy. This strategy normally assumes the surface geometry pieceswise planar, where a smooth penalty is imposed to deal with non-texture or repeating-texture areas. This on one hand, generates much smooth surface models, while on the other hand, may partially leads to smoothing on depth discontinuities, particularly for fence-shaped regions or densely built areas with narrow streets. To solve this problem, in this work, we propose to use the line segment information extracted from the corresponding orthophoto as a pose-processing tool to sharpen the building boundary of the Digital Surface Model (DSM) generated by SGM. Two methods which are based on graph-cut and plane fitting are proposed and compared. Experimental results on several satellite datasets with ground truth show the robustness and effectiveness of the proposed DSM sharpening method.
密集立体匹配以其生成密集三维点的低成本、高灵活度,已成为城市三维重建的主要工具之一。然而,最先进的立体匹配算法通常采用半全局匹配(SGM)策略。此策略通常假定曲面几何块为平面,在平面上对非纹理或重复纹理区域施加平滑惩罚。一方面,这会生成非常平滑的表面模型,另一方面,可能会部分导致深度不连续的平滑,特别是对于栅栏形区域或具有狭窄街道的密集建筑区域。为了解决这一问题,本文提出将从相应的正射影像中提取的直线段信息作为姿态处理工具,来锐化由SGM生成的数字表面模型(DSM)的构建边界。提出并比较了两种基于图形切割和平面拟合的方法。对几种具有地面真值的卫星数据集进行了实验,结果表明,该方法具有较强的鲁棒性和有效性。
https://arxiv.org/abs/1905.09150
Stereo dense image matching can be categorized to low-level feature based matching and deep feature based matching according to their matching cost metrics. Census has been proofed to be one of the most efficient low-level feature based matching methods, while fast Convolutional Neural Network (fst-CNN), as a deep feature based method, has small computing time and is robust for satellite images. Thus, a comparison between fst-CNN and census is critical for further studies in stereo dense image matching. This paper used cost function of fst-CNN and census to do stereo matching, then utilized semi-global matching method to obtain optimized disparity images. Those images are used to produce digital surface model to compare with ground truth points. It addresses that fstCNN performs better than census in the aspect of absolute matching accuracy, histogram of error distribution and matching completeness, but these two algorithms still performs in the same order of magnitude.
立体密集图像匹配根据其匹配成本指标可分为低层次特征匹配和深层特征匹配。人口普查已被证明是最有效的低层次特征匹配方法之一,而快速卷积神经网络(FST CNN)作为一种深度特征匹配方法,计算时间短,对卫星图像具有鲁棒性。因此,比较FST CNN和人口普查对于进一步研究立体密集图像匹配至关重要。本文利用FST CNN的成本函数和人口普查进行立体匹配,然后利用半全局匹配方法获得最佳的视差图像。这些图像用于生成数字表面模型,与地面真值点进行比较。该算法在绝对匹配精度、误差分布直方图和匹配完整性等方面优于人口普查算法,但这两种算法仍然具有相同的量级。
https://arxiv.org/abs/1905.09147
Although nowadays advanced dense image matching (DIM) algorithms are able to produce LiDAR (Light Detection And Ranging) comparable dense point clouds from satellite stereo images, the accuracy and completeness of such point clouds heavily depend on the geometric parameters of the satellite stereo images. The intersection angle between two images are normally seen as the most important one in stereo data acquisition, as the state-of-the-art DIM algorithms work best on narrow baseline (smaller intersection angle) stereos (E.g. Semi-Global Matching regards 15-25 degrees as good intersection angle). This factor is in line with the traditional aerial photogrammetry configuration, as the intersection angle directly relates to the base-high ratio and texture distortion in the parallax direction, thus both affecting the horizontal and vertical accuracy. However, our experiments found that even with very similar (and good) intersection angles, the same DIM algorithm applied on different stereo pairs (of the same area) produced point clouds with dramatically different accuracy as compared to the ground truth LiDAR data. This raises a very practical question that is often asked by practitioners: what factors constitute a good satellite stereo pair, such that it produces accurate and optimal results for mapping purpose? In this work, we provide a comprehensive analysis on this matter by performing stereo matching over 1,000 satellite stereo pairs with different acquisition parameters including their intersection angles, off-nadir angles, sun elevation & azimuth angles, as well as time differences, thus to offer a thorough answer to this question. This work will potentially provide a valuable reference to researchers working on multi-view satellite image reconstruction, as well as industrial practitioners minimizing costs for high-quality large-scale mapping.
虽然目前先进的稠密图像匹配(DIM)算法能够从卫星立体图像中产生与之相当的光达(光探测和测距)稠密点云,但这种点云的精度和完整性在很大程度上取决于卫星立体图像的几何参数。两幅图像之间的交角通常被视为立体数据采集中最重要的一幅,因为最先进的Dim算法在窄基线(较小的交角)立体图像(例如,半全局匹配将15-25度视为良好的交角)上效果最好。这一因素与传统的航空摄影测量配置是一致的,因为交叉角直接关系到视差方向上的基准高比和纹理失真,从而影响到水平和垂直精度。然而,我们的实验发现,即使有非常相似(和良好)的交角,同样的dim算法应用于不同的立体对(同一区域)产生的点云与地面真值激光雷达数据相比,具有显著不同的精度。这就提出了一个非常实际的问题,实践者经常会问:什么因素构成了一个好的卫星立体对,从而产生精确和最佳的测绘结果?在这项工作中,我们通过对1000多对具有不同采集参数的卫星立体声进行立体匹配,包括它们的交叉角、离最低点角、太阳仰角和方位角以及时间差,对这一问题进行了全面的分析,从而为这个问题提供了一个彻底的答案。这项工作将有可能为从事多视图卫星图像重建的研究人员以及工业从业者提供有价值的参考,以最大限度地降低高质量大规模绘图的成本。
https://arxiv.org/abs/1905.07476
Fully parallel architecture at disparity-level for efficient semi-global matching (SGM) with refined rank method is presented. The improved SGM algorithm is implemented with the non-parametric unified rank model which is the combination of Rank filter/AD and Rank SAD. Rank SAD is a novel definition by introducing the constraints of local image structure into the rank method. As a result, the unified rank model with Rank SAD can make up for the defects of Rank filter/AD. Experimental results show both excellent subjective quality and objective performance of the refined SGM algorithm. The fully parallel construction for hardware implementation of SGM is architected with reasonable strategies at disparity-level. The parallelism of the data-stream allows proper throughput for specific applications with acceptable maximum frequency. The results of RTL emulation and synthesis ensure that the proposed parallel architecture is suitable for VLSI implementation.
提出了一种基于改进秩法的有效半全局匹配(SGM)的视差级全并行结构。改进的SGM算法采用秩滤波器/AD和秩SAD相结合的非参数统一秩模型来实现。秩sad是在秩法中引入局部图像结构约束的一种新的定义。结果表明,带秩SAD的统一秩模型可以弥补秩滤波器/AD的不足,实验结果表明改进后的SGM算法具有良好的主观性和客观性能。SGM硬件实现的完全并行结构在差异层次上采用合理的策略进行构建。数据流的并行性允许以可接受的最大频率为特定应用程序提供适当的吞吐量。RTL仿真和综合的结果保证了所提出的并行结构适用于VLSI的实现。
https://arxiv.org/abs/1905.03716
In the stereo matching task, matching cost aggregation is crucial in both traditional methods and deep neural network models in order to accurately estimate disparities. We propose two novel neural net layers, aimed at capturing local and the whole-image cost dependencies respectively. The first is a semi-global aggregation layer which is a differentiable approximation of the semi-global matching, the second is the local guided aggregation layer which follows a traditional cost filtering strategy to refine thin structures. These two layers can be used to replace the widely used 3D convolutional layer which is computationally costly and memory-consuming as it has cubic computational/memory complexity. In the experiments, we show that nets with a two-layer guided aggregation block easily outperform the state-of-the-art GC-Net which has nineteen 3D convolutional layers. We also train a deep guided aggregation network (GA-Net) which gets better accuracies than state-of-the-art methods on both Scene Flow dataset and KITTI benchmarks.
在立体匹配任务中,匹配成本的聚集在传统方法和深度神经网络模型中都起着至关重要的作用,以准确估计视差。我们提出了两种新的神经网络层,分别用于捕获局部和整个图像的成本依赖性。第一种是半全局聚集层,它是半全局匹配的可微近似;第二种是局部引导聚集层,它遵循传统的成本过滤策略来细化细结构。这两层可以用来取代广泛使用的三维卷积层,因为它具有立方计算/内存复杂性,计算成本高,内存消耗大。在实验中,我们发现具有两层导向聚合块的网络很容易优于具有十九个三维卷积层的最先进的GC网络。我们还培训了一个深度引导聚合网络(GA NET),它在场景流数据集和Kitti基准上的精确度比最先进的方法高。
https://arxiv.org/abs/1904.06587
Depth estimation from a single image represents a fascinating, yet challenging problem with countless applications. Recent works proved that this task could be learned without direct supervision from ground truth labels leveraging image synthesis on sequences or stereo pairs. Focusing on this second case, in this paper we leverage stereo matching in order to improve monocular depth estimation. To this aim we propose monoResMatch, a novel deep architecture designed to infer depth from a single input image by synthesizing features from a different point of view, horizontally aligned with the input image, performing stereo matching between the two cues. In contrast to previous works sharing this rationale, our network is the first trained end-to-end from scratch. Moreover, we show how obtaining proxy ground truth annotation through traditional stereo algorithms, such as Semi-Global Matching, enables more accurate monocular depth estimation still countering the need for expensive depth labels by keeping a self-supervised approach. Exhaustive experimental results prove how the synergy between i) the proposed monoResMatch architecture and ii) proxy-supervision attains state-of-the-art for self-supervised monocular depth estimation. The code is publicly available at this https URL.
单个图像的深度估计代表了无数应用程序中一个迷人但具有挑战性的问题。最近的研究证明,在没有直接监督的情况下,利用序列或立体对上的图像合成,可以从地面真值标签学习这项任务。针对第二种情况,本文利用立体匹配来提高单目深度估计。为此,我们提出了一种新的深度匹配结构monoresmatch,它通过从不同的角度合成特征,与输入图像水平对齐,在两个提示之间执行立体匹配,从而从单个输入图像推断深度。与之前共享这一基本原理的作品相比,我们的网络是第一个经过培训的端到端从头开始。此外,我们还展示了如何通过传统的立体算法(如半全局匹配)获得代理地面真值注释,从而通过保持自监督方法来实现更精确的单目深度估计,以满足昂贵的深度标签需求。详尽的实验结果证明了i)所提出的单重匹配结构和ii)代理监督之间的协同作用如何达到自我监督单目深度估计的最先进水平。该代码在此https URL上公开可用。
https://arxiv.org/abs/1904.04144
Disparity by Block Matching stereo is usually used in applications with limited computational power in order to get depth estimates. However, the research on simple stereo methods has been lesser than the energy based counterparts which promise a better quality depth map with more potential for future improvements. Semi-global-matching (SGM) methods offer good performance and easy implementation but suffer from the problem of very high memory footprint because it's working on the full disparity space image. On the other hand, Block matching stereo needs much less memory. In this paper, we introduce a novel multi-scale-hierarchical block-matching approach using a pyramidal variant of depth and cost functions which drastically improves the results of standard block matching stereo techniques while preserving the low memory footprint and further reducing the complexity of standard block matching. We tested our new multi block matching scheme on the Middlebury stereo benchmark. For the Middlebury benchmark we get results that are only slightly worse than state of the art SGM implementations.
https://arxiv.org/abs/1901.09593
Tactile-based blind grasping addresses realistic robotic grasping in which the hand only has access to proprioceptive and tactile sensors. The robotic hand has no prior knowledge of the object/grasp properties, such as object weight, inertia, and shape. There exists no manipulation controller that rigorously guarantees object manipulation in such a setting. Here, a robust control law is proposed for object manipulation in tactile-based blind grasping. The analysis ensures semi-global asymptotic and exponential stability in the presence of model uncertainties and external disturbances that are neglected in related work. Simulation and experimental results validate the effectiveness of the proposed approach.
https://arxiv.org/abs/1709.02924