This paper introduces a lightweight uncertainty estimator capable of predicting multimodal (disjoint) uncertainty bounds by integrating conformal prediction with a deep-learning regressor. We specifically discuss its application for visual odometry (VO), where environmental features such as flying domain symmetries and sensor measurements under ambiguities and occlusion can result in multimodal uncertainties. Our simulation results show that uncertainty estimates in our framework adapt sample-wise against challenging operating conditions such as pronounced noise, limited training data, and limited parametric size of the prediction model. We also develop a reasoning framework that leverages these robust uncertainty estimates and incorporates optical flow-based reasoning to improve prediction prediction accuracy. Thus, by appropriately accounting for predictive uncertainties of data-driven learning and closing their estimation loop via rule-based reasoning, our methodology consistently surpasses conventional deep learning approaches on all these challenging scenarios--pronounced noise, limited training data, and limited model size-reducing the prediction error by 2-3x.
本论文介绍了一种轻量级的不确定性估计器,可以通过将 conformal 预测与深度学习回归器集成来预测多模式(分离)不确定性边界。我们特别讨论了将其应用于视觉测距(VO)的应用,其中环境特征如飞行域对称性和在歧义和遮挡下传感器测量可能会导致多模式不确定性。我们的模拟结果显示,我们的框架中的不确定性估计器样本适应于挑战性的运行条件,如显著的噪声、有限的训练数据和预测模型参数大小的有限值。我们还开发了一个推理框架,利用这些稳健的不确定性估计器并采用光学流推理来提高预测预测精度。因此,通过适当地考虑数据驱动学习的预测不确定性并通过规则推理关闭它们的估计循环,我们的方法在这些挑战情况下 consistently 超越了传统的深度学习方法--显著的噪声、有限的训练数据和有限的模型大小减少了预测误差2-3倍。
https://arxiv.org/abs/2309.11018
In static environments, visual simultaneous localization and mapping (V-SLAM) methods achieve remarkable performance. However, moving objects severely affect core modules of such systems like state estimation and loop closure detection. To address this, dynamic SLAM approaches often use semantic information, geometric constraints, or optical flow to mask features associated with dynamic entities. These are limited by various factors such as a dependency on the quality of the underlying method, poor generalization to unknown or unexpected moving objects, and often produce noisy results, e.g. by masking static but movable objects or making use of predefined thresholds. In this paper, to address these trade-offs, we introduce a novel visual SLAM system, DynaPix, based on per-pixel motion probability values. Our approach consists of a new semantic-free probabilistic pixel-wise motion estimation module and an improved pose optimization process. Our per-pixel motion probability estimation combines a novel static background differencing method on both images and optical flows from splatted frames. DynaPix fully integrates those motion probabilities into both map point selection and weighted bundle adjustment within the tracking and optimization modules of ORB-SLAM2. We evaluate DynaPix against ORB-SLAM2 and DynaSLAM on both GRADE and TUM-RGBD datasets, obtaining lower errors and longer trajectory tracking times. We will release both source code and data upon acceptance of this work.
在静态环境中,视觉同时定位和映射(V-SLAM)方法能够实现卓越的性能。然而,移动物体严重地影响此类系统的核心模块,如状态估计和循环圈检测。为了解决这个问题,动态的V-SLAM方法通常使用语义信息、几何约束或光学流来掩盖与动态实体相关的特征。这些限制了许多因素,如依赖于底层方法的质量、对未知或意外移动物体的泛化较差,并且通常产生噪声性结果,例如通过掩盖静态但可移动的物体或使用预先定义的阈值。在本文中,为了解决这些问题,我们介绍了基于像素运动概率值的新视觉V-SLAM系统DynaPix。我们的研究方法包括一个新的无语义概率像素wise运动估计模块和一个改进的姿态优化过程。我们的像素运动概率估计结合了一种新的静态背景差异处理方法,在图像中同时结合splatted帧的光学流。DynaPix将这些运动概率完全集成到ORB-SLAM2跟踪和优化模块中的地图点选择和加权条带调整中。我们评估了DynaPix与ORB-SLAM2和DynaSLAM在Grade和TUM-RGBD数据集上的性能,取得了更低的错误率和更长的轨迹跟踪时间。我们将在该项目接受后发布源代码和数据。
https://arxiv.org/abs/2309.09879
The ability to detect objects in all lighting (i.e., normal-, over-, and under-exposed) conditions is crucial for real-world applications, such as self-driving.Traditional RGB-based detectors often fail under such varying lighting conditions.Therefore, recent works utilize novel event cameras to supplement or guide the RGB modality; however, these methods typically adopt asymmetric network structures that rely predominantly on the RGB modality, resulting in limited robustness for all-day detection. In this paper, we propose EOLO, a novel object detection framework that achieves robust and efficient all-day detection by fusing both RGB and event modalities. Our EOLO framework is built based on a lightweight spiking neural network (SNN) to efficiently leverage the asynchronous property of events. Buttressed by it, we first introduce an Event Temporal Attention (ETA) module to learn the high temporal information from events while preserving crucial edge information. Secondly, as different modalities exhibit varying levels of importance under diverse lighting conditions, we propose a novel Symmetric RGB-Event Fusion (SREF) module to effectively fuse RGB-Event features without relying on a specific modality, thus ensuring a balanced and adaptive fusion for all-day detection. In addition, to compensate for the lack of paired RGB-Event datasets for all-day training and evaluation, we propose an event synthesis approach based on the randomized optical flow that allows for directly generating the event frame from a single exposure image. We further build two new datasets, E-MSCOCO and E-VOC based on the popular benchmarks MSCOCO and PASCAL VOC. Extensive experiments demonstrate that our EOLO outperforms the state-of-the-art detectors,e.g.,RENet,by a substantial margin (+3.74% mAP50) in all lighting conditions.Our code and datasets will be available at this https URL
在所有照明条件下检测物体对于现实世界应用(如自动驾驶)至关重要。传统的基于RGB的颜色传感器常常在这种不断变化的照明条件下失效。因此,最近的工作利用新型事件摄像头来补充或指导RGB传感器。然而,这些方法通常采用偏斜的网络结构,主要依赖RGB传感器,导致全天候检测的鲁棒性有限。在本文中,我们提出了EOLO,一个崭新的物体检测框架,通过融合RGB和事件传感器的方式实现高效、可靠的全天候检测。我们的EOLO框架基于轻量级突触神经网络(SNN),高效利用事件异步性质。在此基础上,我们首先引入了事件时间注意力(ETA)模块,从事件中提取高时间信息,同时保留关键的边缘信息。其次,在不同照明条件下,不同传感器的重要性表现出不同水平,我们提出了一种新对称的RGB-事件融合(SREF)模块,无需依赖特定的传感器,即可有效地融合RGB-事件特征,从而实现全天候检测的平衡和自适应融合。此外,为了弥补全天候训练和评估缺少配对RGB-事件数据集的缺陷,我们提出了事件合成方法,基于随机光学流,使其能够直接从单个曝光图像生成事件帧。我们进一步基于MSCOCO和PASCAL VOC等常见基准构建了两个新数据集,E-MSCOCO和E-VOC。广泛的实验表明,我们的EOLO在所有照明条件下比最先进的探测器(如RENet)表现更好,(50 mAP50)提高了3.74%。我们的代码和数据集将在这个httpsURL上可用。
https://arxiv.org/abs/2309.09297
Predicting pedestrian movements remains a complex and persistent challenge in robot navigation research. We must evaluate several factors to achieve accurate predictions, such as pedestrian interactions, the environment, crowd density, and social and cultural norms. Accurate prediction of pedestrian paths is vital for ensuring safe human-robot interaction, especially in robot navigation. Furthermore, this research has potential applications in autonomous vehicles, pedestrian tracking, and human-robot collaboration. Therefore, in this paper, we introduce \textbf{FlowMNO}, an Optical Flow-Integrated Markov Neural Operator designed to capture pedestrian behavior across diverse scenarios. Our paper models trajectory prediction as a Markovian process, where future pedestrian coordinates depend solely on the current state. This problem formulation eliminates the need to store previous states. We conducted experiments using standard benchmark datasets like ETH, HOTEL, ZARA1, ZARA2, UCY, and RGB-D pedestrian datasets. Our study demonstrates that FlowMNO outperforms some of the state-of-the-art deep learning methods like LSTM, GAN, and CNN-based approaches, by approximately 86.46\% when predicting pedestrian trajectories. Thus, we show that FlowMNO can seamlessly integrate into robot navigation systems, enhancing their ability to navigate crowded areas smoothly.
预测行人运动在机器人导航研究中仍然是一个复杂且持久的挑战。我们必须评估多个因素以实现准确的预测,例如行人交互、环境、人群密度和社交和文化规范。确保准确的行人路径预测对于确保安全人类-机器人交互至关重要,尤其是在机器人导航中。此外,这项研究具有可能在自动驾驶汽车、行人跟踪和人类-机器人合作方面的应用。因此,在本文中,我们介绍了FlowMNO,它是一种光学流Integrated Markov Neural Operator,旨在捕捉不同场景中的行人行为。我们的论文将路径预测建模为Markov过程,未来的行人坐标仅依赖于当前状态。这个问题的解决避免了存储先前状态的必要性。我们使用标准基准数据集如ETH、酒店、ZARA1、ZARA2、UCY和RGB-D行人数据集进行了实验。我们的研究表明,FlowMNO在预测行人路径方面比一些最先进的深度学习方法如LSTM、GAN和卷积神经网络方法表现更好,大约提高了86.46%。因此,我们表明FlowMNO可以无缝融入机器人导航系统,增强其在拥挤区域平滑导航的能力。
https://arxiv.org/abs/2309.09137
In this work, we contribute towards the development of video-based epileptic seizure classification by introducing a novel framework (SETR-PKD), which could achieve privacy-preserved early detection of seizures in videos. Specifically, our framework has two significant components - (1) It is built upon optical flow features extracted from the video of a seizure, which encodes the seizure motion semiotics while preserving the privacy of the patient; (2) It utilizes a transformer based progressive knowledge distillation, where the knowledge is gradually distilled from networks trained on a longer portion of video samples to the ones which will operate on shorter portions. Thus, our proposed framework addresses the limitations of the current approaches which compromise the privacy of the patients by directly operating on the RGB video of a seizure as well as impede real-time detection of a seizure by utilizing the full video sample to make a prediction. Our SETR-PKD framework could detect tonic-clonic seizures (TCSs) in a privacy-preserving manner with an accuracy of 83.9% while they are only half-way into their progression. Our data and code is available at this https URL
在本研究中,我们旨在发展基于视频的癫痫发作分类技术,并引入了一个 novel 框架 (SETR-PKD),该框架可以实现视频中癫痫发作的早期隐私 preserving 检测。具体来说,我们的框架有两个重要的组成部分 - (1)它基于从癫痫视频中提取的光学流特征构建,这些特征编码了癫痫运动符号,同时保护了患者的隐私;(2)它利用Transformer based 渐近知识蒸馏技术,将知识逐渐从训练在更长视频片段上的网络转移到较短片段上的网络。因此,我们提出的框架解决了当前方法的限制,通过直接操作癫痫RGB视频和利用完整视频样本进行预测,损害了患者隐私。我们的SETR-PKD框架可以在隐私保护的同时,以83.9%的准确性检测强抽搐发作(TCSs),而它们只是刚刚开始进展。我们的数据和代码可用在这个httpsURL上。
https://arxiv.org/abs/2309.08794
We present an approach to estimating camera rotation in crowded, real-world scenes from handheld monocular video. While camera rotation estimation is a well-studied problem, no previous methods exhibit both high accuracy and acceptable speed in this setting. Because the setting is not addressed well by other datasets, we provide a new dataset and benchmark, with high-accuracy, rigorously verified ground truth, on 17 video sequences. Methods developed for wide baseline stereo (e.g., 5-point methods) perform poorly on monocular video. On the other hand, methods used in autonomous driving (e.g., SLAM) leverage specific sensor setups, specific motion models, or local optimization strategies (lagging batch processing) and do not generalize well to handheld video. Finally, for dynamic scenes, commonly used robustification techniques like RANSAC require large numbers of iterations, and become prohibitively slow. We introduce a novel generalization of the Hough transform on SO(3) to efficiently and robustly find the camera rotation most compatible with optical flow. Among comparably fast methods, ours reduces error by almost 50\% over the next best, and is more accurate than any method, irrespective of speed. This represents a strong new performance point for crowded scenes, an important setting for computer vision. The code and the dataset are available at this https URL.
我们提出了一种方法,用于在拥挤的真实场景下估计相机旋转,这些场景由手持单目视频组成。尽管相机旋转估计是一个受到广泛关注的问题,但以前的方法和在该场景中表现出高精度和高速度的方法都没有。由于其他数据集无法很好地解决该场景,我们提供了一个新的数据集和基准,具有高精度且严格验证的地面 truth,对17个视频序列进行了测试。对于宽基线双图像方法(例如5点方法)在单目视频上表现不佳。另一方面,用于自动驾驶(例如SLAM)的方法利用了特定的传感器设置、特定的运动模型或局部优化策略(滞后批量处理),并不太适用于手持视频。最后,对于动态场景,常用的鲁棒增强技术如RANSAC需要大量迭代,变得极其缓慢。我们介绍了SO(3)上的Hough变换的新扩展,以高效且稳健地找到与光学流最兼容的相机旋转。在相对较快的这些方法中,我们的方法几乎可以减少误差的50%,比任何方法都更准确,无论速度如何。这代表了拥挤场景的新性能点,对于计算机视觉是非常重要的场景。代码和数据集可在此httpsURL上获取。
https://arxiv.org/abs/2309.08588
High-resolution multi-modality information acquired by vision-based tactile sensors can support more dexterous manipulations for robot fingers. Optical flow is low-level information directly obtained by vision-based tactile sensors, which can be transformed into other modalities like force, geometry and depth. Current vision-tactile sensors employ optical flow methods from OpenCV to estimate the deformation of markers in gels. However, these methods need to be more precise for accurately measuring the displacement of markers during large elastic deformation of the gel, as this can significantly impact the accuracy of downstream tasks. This study proposes a self-supervised optical flow method based on deep learning to achieve high accuracy in displacement measurement for vision-based tactile sensors. The proposed method employs a coarse-to-fine strategy to handle large deformations by constructing a multi-scale feature pyramid from the input image. To better deal with the elastic deformation caused by the gel, the Helmholtz velocity decomposition constraint combined with the elastic deformation constraint are adopted to address the distortion rate and area change rate, respectively. A local flow fusion module is designed to smooth the optical flow, taking into account the prior knowledge of the blurred effect of gel deformation. We trained the proposed self-supervised network using an open-source dataset and compared it with traditional and deep learning-based optical flow methods. The results show that the proposed method achieved the highest displacement measurement accuracy, thereby demonstrating its potential for enabling more precise measurement of downstream tasks using vision-based tactile sensors.
利用视觉based tactile传感器获取的高分辨率多模态信息可支持机器人手指更加灵活的操作。光学流是直接由视觉based tactile传感器获取的低级别信息,可以转化为其他模态,如力量、几何和深度。目前的视觉-tactile传感器使用OpenCV的光学流方法来估计 Gel 中 markers 的变形。然而,这些方法需要更精确地准确测量标记在 Gel 大弹性变形期间的位置位移,因为这会显著影响后续任务的准确性。本文提出了基于深度学习的自我监督光学流方法,以用于视觉based tactile传感器的位移测量精度。该方法采用粗到精的策略,通过从输入图像构建多尺度特征金字塔来处理大型变形。为了更好地处理 Gel 引起的弹性变形,采用 Helmholtz 速度分解约束和弹性变形约束来解决扭曲率和面积变化率。一个本地流融合模块被设计来平滑光学流,并考虑 Gel 变形模糊的先前知识。我们使用开放源代码数据集训练了 proposed 的自我监督网络,并与传统和深度学习based 的光学流方法进行了比较。结果显示,该方法实现了最高的位移测量精度,从而证明了它利用视觉based tactile传感器实现更精确下游任务测量的潜力。
https://arxiv.org/abs/2309.06735
The accuracy of learning-based optical flow estimation models heavily relies on the realism of the training datasets. Current approaches for generating such datasets either employ synthetic data or generate images with limited realism. However, the domain gap of these data with real-world scenes constrains the generalization of the trained model to real-world applications. To address this issue, we investigate generating realistic optical flow datasets from real-world images. Firstly, to generate highly realistic new images, we construct a layered depth representation, known as multiplane images (MPI), from single-view images. This allows us to generate novel view images that are highly realistic. To generate optical flow maps that correspond accurately to the new image, we calculate the optical flows of each plane using the camera matrix and plane depths. We then project these layered optical flows into the output optical flow map with volume rendering. Secondly, to ensure the realism of motion, we present an independent object motion module that can separate the camera and dynamic object motion in MPI. This module addresses the deficiency in MPI-based single-view methods, where optical flow is generated only by camera motion and does not account for any object movement. We additionally devise a depth-aware inpainting module to merge new images with dynamic objects and address unnatural motion occlusions. We show the superior performance of our method through extensive experiments on real-world datasets. Moreover, our approach achieves state-of-the-art performance in both unsupervised and supervised training of learning-based models. The code will be made publicly available at: \url{this https URL}.
学习基于光学流估计模型的准确性很大程度上依赖于训练数据的真实性。目前用于生成此类数据的方法要么使用合成数据,要么生成具有有限真实性的图像。然而,这些数据与现实世界场景的domain gap限制了训练模型对现实世界应用的推广。为了解决这一问题,我们研究从现实世界图像生成真实感的光学流数据集的方法。首先,为了生成高度真实的新图像,我们构建了一个分层的深度表示,称为多平面图像(MPI),从单个视角图像中生成。这允许我们生成高度真实的新视角图像。为了生成与新图像准确地对应的光学流地图,我们使用相机矩阵和平面深度计算每个平面的光学流。然后,我们将这些分层光学流写入体积渲染的输出光学流地图中。其次,为了确保运动的真实性,我们介绍了一个独立的物体运动模块,可以在MPI中的相机和动态物体运动分离。该模块解决了MPI基于单视角方法中的不足之处,即光学流只由相机运动生成,并且不考虑到任何物体运动。我们此外设计了一个深度aware填充模块,以将新图像与动态物体合并,并处理不自然的运动遮挡。我们通过在真实世界数据集上进行广泛的实验展示了我们方法的优越性能。此外,我们的方法在 unsupervised 和 supervised 学习基于模型的训练中都实现了最先进的性能。代码将公开可用: \url{this https URL}。
https://arxiv.org/abs/2309.06714
Visual odometry (VO) and SLAM have been using multi-view geometry via local structure from motion for decades. These methods have a slight disadvantage in challenging scenarios such as low-texture images, dynamic scenarios, etc. Meanwhile, use of deep neural networks to extract high level features is ubiquitous in computer vision. For VO, we can use these deep networks to extract depth and pose estimates using these high level features. The visual odometry task then can be modeled as an image generation task where the pose estimation is the by-product. This can also be achieved in a self-supervised manner, thereby eliminating the data (supervised) intensive nature of training deep neural networks. Although some works tried the similar approach [1], the depth and pose estimation in the previous works are vague sometimes resulting in accumulation of error (drift) along the trajectory. The goal of this work is to tackle these limitations of past approaches and to develop a method that can provide better depths and pose estimates. To address this, a couple of approaches are explored: 1) Modeling: Using optical flow and recurrent neural networks (RNN) in order to exploit spatio-temporal correlations which can provide more information to estimate depth. 2) Loss function: Generative adversarial network (GAN) [2] is deployed to improve the depth estimation (and thereby pose too), as shown in Figure 1. This additional loss term improves the realism in generated images and reduces artifacts.
视觉里程计(VO)和单点定位(SLAM)已经使用多视角几何从运动中提取局部结构数十年。这些方法在挑战性场景中略微存在一些劣势,例如低纹理图像、动态场景等。与此同时,使用深度学习网络提取高级特征已成为计算机视觉的普遍做法。对于VO,我们可以使用这些深度学习网络使用这些高级特征来提取深度和姿态估计。视觉里程计任务可以建模为图像生成任务,姿态估计作为副产品。这也可以通过自我监督实现,从而消除训练深度学习网络的强调数据(监督)性质。尽管一些工作尝试过类似的方法[1],但之前的工作中的深度和姿态估计有些模糊,有时导致沿着路径的错误(漂移)积累。这项工作的目标是解决过去方法的局限性,并开发一种方法,可以提供更好的深度和姿态估计。为了解决这个问题,有几种方法值得探索:1)建模:使用光学流和循环神经网络(RNN)来利用时空相关性,从而提供用于估计深度更多信息。2)损失函数:生成对抗网络(GAN)[2]用于提高深度估计(并因此姿态估计),如图1所示。这个额外的损失 term 改善生成的图像的逼真性和减少误差。
https://arxiv.org/abs/2309.04147
Flow-based propagation and spatiotemporal Transformer are two mainstream mechanisms in video inpainting (VI). Despite the effectiveness of these components, they still suffer from some limitations that affect their performance. Previous propagation-based approaches are performed separately either in the image or feature domain. Global image propagation isolated from learning may cause spatial misalignment due to inaccurate optical flow. Moreover, memory or computational constraints limit the temporal range of feature propagation and video Transformer, preventing exploration of correspondence information from distant frames. To address these issues, we propose an improved framework, called ProPainter, which involves enhanced ProPagation and an efficient Transformer. Specifically, we introduce dual-domain propagation that combines the advantages of image and feature warping, exploiting global correspondences reliably. We also propose a mask-guided sparse video Transformer, which achieves high efficiency by discarding unnecessary and redundant tokens. With these components, ProPainter outperforms prior arts by a large margin of 1.46 dB in PSNR while maintaining appealing efficiency.
流based传播和时空Transformer是视频修复(VI)的两个主要机制。尽管这些组件的有效性,但它们仍然受到一些影响,影响其性能。以前的传播方法分别在图像或特征域进行单独处理。将全球图像传播从学习中孤立出来可能会导致空间不匹配,因为不准确的光学流动。此外,内存或计算限制限制特征传播和视频Transformer的时间范围,防止从遥远帧探索对应信息。为了解决这些问题,我们提出了一个改进的框架,称为ProPainter,它涉及增强的ProPagation和高效的Transformer。具体来说,我们引入了双重 domain 传播,结合图像和特征扭曲的优势, reliablely 利用全球对应关系。我们还提出了一个 mask-引导稀疏视频Transformer,通过丢弃不必要的和冗余的代币来实现高效的。有了这些组件,ProPainter在PSNR方面比先前方法提高了1.46dB,同时保持了吸引人的效率。
https://arxiv.org/abs/2309.03897
Video frame interpolation is an important low-level vision task, which can increase frame rate for more fluent visual experience. Existing methods have achieved great success by employing advanced motion models and synthesis networks. However, the spatial redundancy when synthesizing the target frame has not been fully explored, that can result in lots of inefficient computation. On the other hand, the computation compression degree in frame interpolation is highly dependent on both texture distribution and scene motion, which demands to understand the spatial-temporal information of each input frame pair for a better compression degree selection. In this work, we propose a novel two-stage frame interpolation framework termed WaveletVFI to address above problems. It first estimates intermediate optical flow with a lightweight motion perception network, and then a wavelet synthesis network uses flow aligned context features to predict multi-scale wavelet coefficients with sparse convolution for efficient target frame reconstruction, where the sparse valid masks that control computation in each scale are determined by a crucial threshold ratio. Instead of setting a fixed value like previous methods, we find that embedding a classifier in the motion perception network to learn a dynamic threshold for each sample can achieve more computation reduction with almost no loss of accuracy. On the common high resolution and animation frame interpolation benchmarks, proposed WaveletVFI can reduce computation up to 40% while maintaining similar accuracy, making it perform more efficiently against other state-of-the-arts. Code is available at this https URL.
视频帧插值是重要的低级别视觉任务,可以通过增加帧速率来提供更流畅的视觉体验。现有方法通过使用高级运动模型和合成网络取得了巨大的成功。然而,在合成目标帧时,存在空间冗余,这可能会导致大量的无效计算。另一方面,帧插值的计算压缩程度 highly depends on both texture distribution and scene motion,这需要理解每个输入帧 pair 的 spatial-temporal 信息,以更好地选择压缩程度。在这项工作中,我们提出了名为 WaveletVFI 的新两阶段帧插值框架,以解决上述问题。它首先使用轻量级运动感知网络估计中间光学流动,然后使用流对齐上下文特征预测多尺度小波系数,以高效地重建目标帧,其中稀疏有效的掩码在每个尺度上由关键阈值比决定。与我们以前的方法不同,我们发现将分类器嵌入运动感知网络中,学习每个样本的动态阈值可以实现更多的计算减少,而几乎没有任何精度损失。在常见的高分辨率和动画帧插值基准测试中,提出的 WaveletVFI 可以高达 40% 的计算减少,同时保持相似的精度,使其在与高级方法相比更高效地表现。代码已放在这个 https URL 上。
https://arxiv.org/abs/2309.03508
Despite significant progress in video question answering (VideoQA), existing methods fall short of questions that require causal/temporal reasoning across frames. This can be attributed to imprecise motion representations. We introduce Action Temporality Modeling (ATM) for temporality reasoning via three-fold uniqueness: (1) rethinking the optical flow and realizing that optical flow is effective in capturing the long horizon temporality reasoning; (2) training the visual-text embedding by contrastive learning in an action-centric manner, leading to better action representations in both vision and text modalities; and (3) preventing the model from answering the question given the shuffled video in the fine-tuning stage, to avoid spurious correlation between appearance and motion and hence ensure faithful temporality reasoning. In the experiments, we show that ATM outperforms previous approaches in terms of the accuracy on multiple VideoQAs and exhibits better true temporality reasoning ability.
尽管视频问答(VideoQA)取得了显著进展,但现有方法无法满足跨越帧之间的因果/时间推理问题。这可能是由于运动表示的不精确性。我们提出了一种名为Action Temporality Modeling(ATM)的方法,用于进行时间推理,其独特之处在于三个方面:(1)重新思考光学流,并认识到光学流可以捕捉长期时间推理;(2)通过对比学习的方式训练视觉文本嵌入,以提高视觉和文本modality中的动作表示;(3)在 fine-tuning 阶段,通过换帧视频来防止模型回答问题,以避免外观和运动之间的伪相关性,从而确保准确的时间推理。在实验中,我们证明了ATM在多个视频问答测试中的准确性方面比先前方法表现出色,并展现了更好的真实时间推理能力。
https://arxiv.org/abs/2309.02290
In this paper, we devise a mechanism for the addition of multi-modal information with an existing pipeline for continuous sign language recognition and translation. In our procedure, we have incorporated optical flow information with RGB images to enrich the features with movement-related information. This work studies the feasibility of such modality inclusion using a cross-modal encoder. The plugin we have used is very lightweight and doesn't need to include a separate feature extractor for the new modality in an end-to-end manner. We have applied the changes in both sign language recognition and translation, improving the result in each case. We have evaluated the performance on the RWTH-PHOENIX-2014 dataset for sign language recognition and the RWTH-PHOENIX-2014T dataset for translation. On the recognition task, our approach reduced the WER by 0.9, and on the translation task, our approach increased most of the BLEU scores by ~0.6 on the test set.
在本文中,我们设计了一种机制,通过加入现有的连续口语识别和翻译 pipeline 来增加多模态信息。在我们的程序中,我们使用了光学流信息和RGB图像来丰富特征,以增加与运动相关的信息。这项工作研究了使用跨模态编码器进行这种模态的添加的可行性。我们使用的插件非常轻便,不需要为新的模态单独设置特征提取器,可以以端到端的方式使用。我们在口语识别和翻译任务中应用了变化,每项任务都提高了结果。我们评估了RTH-PHOENIX-2014口语识别数据集和RTH-PHOENIX-2014T翻译数据集的性能。在识别任务中,我们的方法降低了WER 0.9,在翻译任务中,我们的方法在测试集上提高了大多数BLEU得分的0.6以上。
https://arxiv.org/abs/2309.01860
We introduce a novel training strategy for stereo matching and optical flow estimation that utilizes image-to-image translation between synthetic and real image domains. Our approach enables the training of models that excel in real image scenarios while relying solely on ground-truth information from synthetic images. To facilitate task-agnostic domain adaptation and the training of task-specific components, we introduce a bidirectional feature warping module that handles both left-right and forward-backward directions. Experimental results show competitive performance over previous domain translation-based methods, which substantiate the efficacy of our proposed framework, effectively leveraging the benefits of unsupervised domain adaptation, stereo matching, and optical flow estimation.
我们提出了一种用于双视觉匹配和光流估计的全新的训练策略,该策略利用合成和真实图像域之间的图像到图像翻译。我们的策略使能够在仅仅依靠合成图像的 ground-truth信息的情况下训练那些在真实图像场景中表现优异的模型。为了便于无任务适应和任务特定的组件训练,我们引入了一个双向特征重构模块,它能够处理左右和前后方向。实验结果显示,与之前基于任务翻译的方法相比,该策略具有竞争力的表现,这支持了我们提出的框架的有效性,有效地利用了无任务适应、双视觉匹配和光流估计的好处。
https://arxiv.org/abs/2309.01842
Temporal echo image registration is a basis for clinical quantifications such as cardiac motion estimation, myocardial strain assessments, and stroke volume quantifications. Deep learning image registration (DLIR) is consistently accurate, requires less computing effort, and has shown encouraging results in earlier applications. However, we propose that a greater focus on the warped moving image's anatomic plausibility and image quality can support robust DLIR performance. Further, past implementations have focused on adult echo, and there is an absence of DLIR implementations for fetal echo. We propose a framework combining three strategies for DLIR for both fetal and adult echo: (1) an anatomic shape-encoded loss to preserve physiological myocardial and left ventricular anatomical topologies in warped images; (2) a data-driven loss that is trained adversarially to preserve good image texture features in warped images; and (3) a multi-scale training scheme of a data-driven and anatomically constrained algorithm to improve accuracy. Our experiments show that the shape-encoded loss and the data-driven adversarial loss are strongly correlated to good anatomical topology and image textures, respectively. They improve different aspects of registration performance in a non-overlapping way, justifying their combination. We show that these strategies can provide excellent registration results in both adult and fetal echo using the publicly available CAMUS adult echo dataset and our private multi-demographic fetal echo dataset, despite fundamental distinctions between adult and fetal echo images. Our approach also outperforms traditional non-DL gold standard registration approaches, including Optical Flow and Elastix. Registration improvements could also be translated to more accurate and precise clinical quantification of cardiac ejection fraction, demonstrating a potential for translation.
时间超声波图像匹配是临床量化的基础,例如心脏运动估计、心肌收缩评估和脑鸣量化。深度学习图像匹配(DLIR)是 consistently 准确的, 需要的计算资源更少,并在早期应用中表现出令人鼓舞的结果。然而,我们建议更加关注扭曲图像的解剖学可能性和图像质量可以支持稳健的DLIR性能。此外,过去的做法是关注成年人的超声波图像,而尚未实现DLIR对婴儿超声波图像的实现。我们提出了一个框架,将三种策略组合在一起,用于同时处理成年人和婴儿超声波图像的DLIR:(1) 形态编码损失,以保留生理的心肌和左心室形态拓扑在扭曲图像中;(2) 数据驱动的损失,以训练对抗性地保留好的图像纹理特征在扭曲图像中;(3) 多尺度的数据驱动和形态学约束算法的训练计划,以提高准确性。我们的实验表明,形态编码损失和数据驱动的对抗损失都与良好的形态学拓扑和图像纹理强相关。它们在不同方面改善注册性能, justify它们的组合。我们表明,这些策略可以使用公开的CAMUS成年人超声波图像集和我们的私有多年龄学婴儿超声波图像集,尽管成年人和婴儿超声波图像之间存在基本差异。我们的方法还优于传统的非DL黄金标准注册方法,包括光学流和欧拉法。注册改进还可以转化为更精确和精确的心脏射血分数的量化,表明翻译的潜力。
https://arxiv.org/abs/2309.00831
Supervised and unsupervised techniques have demonstrated the potential for temporal interpolation of video data. Nevertheless, most prevailing temporal interpolation techniques hinge on optical flow, which encodes the motion of pixels between video frames. On the other hand, geospatial data exhibits lower temporal resolution while encompassing a spectrum of movements and deformations that challenge several assumptions inherent to optical flow. In this work, we propose an unsupervised temporal interpolation technique, which does not rely on ground truth data or require any motion information like optical flow, thus offering a promising alternative for better generalization across geospatial domains. Specifically, we introduce a self-supervised technique of dual cycle consistency. Our proposed technique incorporates multiple cycle consistency losses, which result from interpolating two frames between consecutive input frames through a series of stages. This dual cycle consistent constraint causes the model to produce intermediate frames in a self-supervised manner. To the best of our knowledge, this is the first attempt at unsupervised temporal interpolation without the explicit use of optical flow. Our experimental evaluations across diverse geospatial datasets show that STint significantly outperforms existing state-of-the-art methods for unsupervised temporal interpolation.
监督和无监督技术已经证明了视频数据的时间插值的潜力。然而,大多数流行的时间插值技术依赖于光学流,该算法编码了视频帧之间的像素运动。另一方面,地理信息系统数据表现出较低的时间分辨率,同时涵盖了挑战光学流的几个假设的一系列运动和变形。在这项工作中,我们提出了一种无监督的时间插值技术,该方法不依赖于实际数据或需要与光学流类似的运动信息,因此为跨地理信息系统的更好泛化提供了一个有希望的替代方案。具体来说,我们介绍了一种双重循环一致性的自我监督技术。我们提出的技术包括多个循环一致性损失,这些损失来源于连续输入帧之间的插值通过一系列阶段而产生的。这种双重循环一致性约束使模型以自我监督的方式生成中间帧。据我们所知,这是第一种在没有使用光学流的情况下无监督时间插值的尝试。我们在不同的地理信息数据集上的实验评估表明,Sint在无监督时间插值方面显著优于现有的无监督时间插值方法。
https://arxiv.org/abs/2309.00059
This paper develops a new vascular respiratory motion compensation algorithm, Motion-Related Compensation (MRC), to conduct vascular respiratory motion compensation by extrapolating the correlation between invisible vascular and visible non-vascular. Robot-assisted vascular intervention can significantly reduce the radiation exposure of surgeons. In robot-assisted image-guided intervention, blood vessels are constantly moving/deforming due to respiration, and they are invisible in the X-ray images unless contrast agents are injected. The vascular respiratory motion compensation technique predicts 2D vascular roadmaps in live X-ray images. When blood vessels are visible after contrast agents injection, vascular respiratory motion compensation is conducted based on the sparse Lucas-Kanade feature tracker. An MRC model is trained to learn the correlation between vascular and non-vascular motions. During the intervention, the invisible blood vessels are predicted with visible tissues and the trained MRC model. Moreover, a Gaussian-based outlier filter is adopted for refinement. Experiments on in-vivo data sets show that the proposed method can yield vascular respiratory motion compensation in 0.032 sec, with an average error 1.086 mm. Our real-time and accurate vascular respiratory motion compensation approach contributes to modern vascular intervention and surgical robots.
本论文开发了一种新的血管呼吸运动补偿算法,即运动相关补偿(MRC),通过推断血管和可见非血管之间的关联来实现血管呼吸运动补偿。借助机器人辅助血管介入技术,可以显著减少医生的辐射暴露。在机器人辅助影像引导介入中,由于呼吸作用,血管会不断运动/变形,在X光片中变得不可见,除非注入对比剂。血管呼吸运动补偿技术预测实时X光图像中的2D血管 roadmap。对比剂注入后,血管才会变得可见,此时基于稀疏Lucas-Kanade特征跟踪的血管呼吸运动补偿算法开始工作。训练出的MRC模型学习到血管和非血管运动之间的关联。在干预过程中,利用可见组织和训练出的MRC模型预测看不见的血管。此外,采用基于高斯滤波的异常检测过滤器进行精调。在vivo数据集上的实验表明,该方法可以在0.032秒内实现血管呼吸运动补偿,平均误差为1.086毫米。我们的实时精确血管呼吸运动补偿方法为现代血管介入和手术机器人做出贡献。
https://arxiv.org/abs/2308.16451
Lossy video compression is commonly used when transmitting and storing video data. Unified video codecs (e.g., H.264 or H.265) remain the \emph{de facto} standard, despite the availability of advanced (neural) compression approaches. Transmitting videos in the face of dynamic network bandwidth conditions requires video codecs to adapt to vastly different compression strengths. Rate control modules augment the codec's compression such that bandwidth constraints are satisfied and video distortion is minimized. While, both standard video codes and their rate control modules are developed to minimize video distortion w.r.t. human quality assessment, preserving the downstream performance of deep vision models is not considered. In this paper, we present the first end-to-end learnable deep video codec control considering both bandwidth constraints and downstream vision performance, while not breaking existing standardization. We demonstrate for two common vision tasks (semantic segmentation and optical flow estimation) and on two different datasets that our deep codec control better preserves downstream performance than using 2-pass average bit rate control while meeting dynamic bandwidth constraints and adhering to standardizations.
损失型视频压缩是传输和存储视频数据时常见的方法。统一的视频编解码器(如H.264或H.265)尽管已经存在,但仍然被视为“事实上的标准”。在面临动态网络带宽条件时,传输视频需要视频编解码器适应完全不同的压缩强度。 Rate control modules可以增加编解码器的压缩,以满足带宽限制,并最大限度地减少视频失真。虽然标准视频编码和其 Rate control 模块都被开发用于减少与人类质量评估相关的视频失真,但并没有考虑保留深度视觉模型的后续性能。在本文中,我们提出了第一个考虑带宽限制和后续视觉性能的端到端可学习深度视频编解码控制,而不必破坏现有标准化。我们针对两个常见的视觉任务(语义分割和光学流估计)以及两个不同数据集进行了演示,证明我们的深度编解码控制更好地保留了后续性能,而在满足动态带宽限制并遵循标准化的情况下。
https://arxiv.org/abs/2308.16215
Fast, reliable shape reconstruction is an essential ingredient in many computer vision applications. Neural Radiance Fields demonstrated that photorealistic novel view synthesis is within reach, but was gated by performance requirements for fast reconstruction of real scenes and objects. Several recent approaches have built on alternative shape representations, in particular, 3D Gaussians. We develop extensions to these renderers, such as integrating differentiable optical flow, exporting watertight meshes and rendering per-ray normals. Additionally, we show how two of the recent methods are interoperable with each other. These reconstructions are quick, robust, and easily performed on GPU or CPU. For code and visual examples, see this https URL
快速、可靠的形状重构是许多计算机视觉应用的关键要素。神经网络辐射场表明,实现逼真的新视角合成并不难,但需要满足快速真实场景和物体重构的性能要求。近年来,多个方法都基于 alternative shape representations,特别是3D高斯函数。我们对这些渲染器进行扩展,例如集成可区分的光学流、导出可靠的网格、以及渲染每个射线正交向量。此外,我们还展示了最近两种方法之间的互操作性。这些重构在GPU或CPU上快速、可靠且容易实现。代码和可视化例子见以下https URL。
https://arxiv.org/abs/2308.14737
Existing approaches to unsupervised video instance segmentation typically rely on motion estimates and experience difficulties tracking small or divergent motions. We present VideoCutLER, a simple method for unsupervised multi-instance video segmentation without using motion-based learning signals like optical flow or training on natural videos. Our key insight is that using high-quality pseudo masks and a simple video synthesis method for model training is surprisingly sufficient to enable the resulting video model to effectively segment and track multiple instances across video frames. We show the first competitive unsupervised learning results on the challenging YouTubeVIS-2019 benchmark, achieving 50.7% APvideo^50 , surpassing the previous state-of-the-art by a large margin. VideoCutLER can also serve as a strong pretrained model for supervised video instance segmentation tasks, exceeding DINO by 15.9% on YouTubeVIS-2019 in terms of APvideo.
现有的 unsupervised 视频实例分割方法通常依赖于运动估计和追踪小型或离散的运动。我们提出了 VideoCutler 一种简单的方法,用于 unsupervised 的多实例视频分割,而不需要基于运动的学习信号,如光学流或训练自然视频。我们的关键发现是使用高质量的伪 masks 和简单的视频合成方法进行模型训练是出人意料的足够,使 resulting 视频模型能够有效地在不同帧之间分割和跟踪多个实例。我们在挑战性的 YouTubeVIS-2019 基准上展示了第一个竞争性的 unsupervised 学习结果,达到 50.7% APvideo^50,远远超过了以前的最高水平。VideoCutler 还可以作为 supervised 视频实例分割任务的强大预训练模型,在 YouTubeVIS-2019 上以 APvideo 相比提高了15.9%。
https://arxiv.org/abs/2308.14710