The scarcity of ground-truth labels poses one major challenge in developing optical flow estimation models that are both generalizable and robust. While current methods rely on data augmentation, they have yet to fully exploit the rich information available in labeled video sequences. We propose OCAI, a method that supports robust frame interpolation by generating intermediate video frames alongside optical flows in between. Utilizing a forward warping approach, OCAI employs occlusion awareness to resolve ambiguities in pixel values and fills in missing values by leveraging the forward-backward consistency of optical flows. Additionally, we introduce a teacher-student style semi-supervised learning method on top of the interpolated frames. Using a pair of unlabeled frames and the teacher model's predicted optical flow, we generate interpolated frames and flows to train a student model. The teacher's weights are maintained using Exponential Moving Averaging of the student. Our evaluations demonstrate perceptually superior interpolation quality and enhanced optical flow accuracy on established benchmarks such as Sintel and KITTI.
真实标签的稀缺性在开发既具有普适性又具有韧性的光流估计模型方面构成了一个主要挑战。虽然当前方法依赖于数据增强,但它们尚未充分利用已有的带标签视频序列中存在的丰富信息。我们提出了一种名为OCAI的方法,通过在光流之间生成中间视频帧来支持鲁棒帧插值。利用前膨胀方法,OCAI通过利用光流的前后一致性来解决像素值的不确定性,并利用光流的信息来填充缺失值。此外,我们在插值帧上引入了一种师生风格的中半监督学习方法。通过使用带标签帧和学生模型的预测光流,我们生成插值帧和光流以训练学生模型。学生模型的权重通过学生模型的学生指数进行维持。我们在Sintel和KITTI等已建立的标准基准上进行评估,证明了我们的插值质量和光流准确性的提高。
https://arxiv.org/abs/2403.18092
Accurate velocity estimation of surrounding moving objects and their trajectories are critical elements of perception systems in Automated/Autonomous Vehicles (AVs) with a direct impact on their safety. These are non-trivial problems due to the diverse types and sizes of such objects and their dynamic and random behaviour. Recent point cloud based solutions often use Iterative Closest Point (ICP) techniques, which are known to have certain limitations. For example, their computational costs are high due to their iterative nature, and their estimation error often deteriorates as the relative velocities of the target objects increase (>2 m/sec). Motivated by such shortcomings, this paper first proposes a novel Detection and Tracking of Moving Objects (DATMO) for AVs based on an optical flow technique, which is proven to be computationally efficient and highly accurate for such problems. \textcolor{black}{This is achieved by representing the driving scenario as a vector field and applying vector calculus theories to ensure spatiotemporal continuity.} We also report the results of a comprehensive performance evaluation of the proposed DATMO technique, carried out in this study using synthetic and real-world data. The results of this study demonstrate the superiority of the proposed technique, compared to the DATMO techniques in the literature, in terms of estimation accuracy and processing time in a wide range of relative velocities of moving objects. Finally, we evaluate and discuss the sensitivity of the estimation error of the proposed DATMO technique to various system and environmental parameters, as well as the relative velocities of the moving objects.
准确的速度估计周围运动的物体及其轨迹是自动驾驶车辆(AVs)感知系统的关键要素,对车辆的安全具有直接影响。由于这些物体具有多样化和大小的非 trivial 问题以及其动态和随机行为,因此这些问题并不简单。最近基于点云的解决方案通常使用迭代最近点(ICP)技术,而众所周知,这种技术存在某些局限性。例如,由于其迭代性质,它们的计算成本很高,并且随着目标物体相对速度的增加(>2 m/sec),它们的估计误差往往恶化。为了克服这些缺陷,本文首先提出了一个基于光流技术的自动驾驶车辆(DATMO)新检测和跟踪系统,该技术通过证明在类似问题中具有计算效率和高度准确性的特点而得到证实。\textcolor{black}{通过将驾驶场景表示为向量场,并应用向量微积分理论来确保时空连续性。}我们还对所提出的 DATMO 技术进行了全面性能评估,该评估在本文中使用合成和真实世界数据进行。本研究的结果表明,与文献中的 DATMO 技术相比,所提出的技术在广泛的相对速度范围内具有更高的估计精度和处理时间。最后,我们评估和讨论了所提出的 DATMO 技术的估计误差对各种系统和环境参数以及运动物体的相对速度的敏感性。
https://arxiv.org/abs/2403.17779
The advancement of generation models has led to the emergence of highly realistic artificial intelligence (AI)-generated videos. Malicious users can easily create non-existent videos to spread false information. This letter proposes an effective AI-generated video detection (AIGVDet) scheme by capturing the forensic traces with a two-branch spatio-temporal convolutional neural network (CNN). Specifically, two ResNet sub-detectors are learned separately for identifying the anomalies in spatical and optical flow domains, respectively. Results of such sub-detectors are fused to further enhance the discrimination ability. A large-scale generated video dataset (GVD) is constructed as a benchmark for model training and evaluation. Extensive experimental results verify the high generalization and robustness of our AIGVDet scheme. Code and dataset will be available at this https URL.
随着生成模型的进步,已经出现了高度逼真的人工智能(AI)生成的视频。恶意用户可以轻松地创建不存在的视频传播虚假信息。本文提出了一种通过捕获带两个分支时空卷积神经网络(CNN)的鉴定痕迹的有效人工智能生成视频(AIGVDet)方案。具体来说,分别学习两个ResNet子检测器来识别空间和光学流域中的异常。这样的子检测器的检测结果被融合以进一步增强识别能力。构建了一个大规模生成的视频数据集(GVD)作为模型训练和评估的基准。大量实验结果证实了我们AIGVDet方案的高通性和鲁棒性。代码和数据集将在这个链接处提供。
https://arxiv.org/abs/2403.16638
Applications of an efficient emotion recognition system can be found in several domains such as medicine, driver fatigue surveillance, social robotics, and human-computer interaction. Appraising human emotional states, behaviors, and reactions displayed in real-world settings can be accomplished using latent continuous dimensions. Continuous dimensional models of human affect, such as those based on valence and arousal are more accurate in describing a broad range of spontaneous everyday emotions than more traditional models of discrete stereotypical emotion categories (e.g. happiness, surprise). Most of the prior work on estimating valence and arousal considers laboratory settings and acted data. But, for emotion recognition systems to be deployed and integrated into real-world mobile and computing devices, we need to consider data collected in the world. Action recognition is a domain of Computer Vision that involves capturing complementary information on appearance from still frames and motion between frames. In this paper, we treat emotion recognition from the perspective of action recognition by exploring the application of deep learning architectures specifically designed for action recognition, for continuous affect recognition. We propose a novel three-stream end-to-end deep learning regression pipeline with an attention mechanism, which is an ensemble design based on sub-modules of multiple state-of-the-art action recognition systems. The pipeline constitutes a novel data pre-processing approach with a spatial self-attention mechanism to extract keyframes. The optical flow of high-attention regions of the face is extracted to capture temporal context. AFEW-VA in-the-wild dataset has been used to conduct comparative experiments. Quantitative analysis shows that the proposed model outperforms multiple standard baselines of both emotion recognition and action recognition models.
高效情感识别系统的应用范围存在于医学、驾驶员疲劳监测、社会机器人学和人机交互等多个领域。评估现实场景中的人类情感状态、行为和反应可以使用潜在连续维度。基于愉悦和激情的连续维度模型比更传统的离散刻板情感分类模型更准确地描述广泛的日常自发性情感。在估计情感和情绪方面,大部分先前的研究都集中在实验室环境和已有的数据上。但是,为了将情感识别系统部署并集成到现实世界的移动和计算设备中,我们需要考虑从世界中收集的数据。动作识别是一个计算机视觉领域,涉及从静帧和帧之间的运动中捕捉互补信息。在本文中,我们从动作识别的角度来探讨应用专门为动作识别设计的深度学习架构,进行连续情感识别。我们提出了一个新颖的三流端到端深度学习回归管道,带有一个关注机制,这是基于多个最先进的动作识别系统的子模块的集成设计。该管道构成了一个新颖的数据预处理方法,具有空间自注意机制以提取关键帧。提取人脸高关注区域的光学流,以捕捉时间语境。AFEW-VA野外数据集已用于进行比较实验。定量分析表明,与情绪识别和动作识别模型的多个标准基线相比,所提出的模型具有优异的性能。
https://arxiv.org/abs/2403.16263
Implicit neural representations for video (NeRV) have recently become a novel way for high-quality video representation. However, existing works employ a single network to represent the entire video, which implicitly confuse static and dynamic information. This leads to an inability to effectively compress the redundant static information and lack the explicitly modeling of global temporal-coherent dynamic details. To solve above problems, we propose DS-NeRV, which decomposes videos into sparse learnable static codes and dynamic codes without the need for explicit optical flow or residual supervision. By setting different sampling rates for two codes and applying weighted sum and interpolation sampling methods, DS-NeRV efficiently utilizes redundant static information while maintaining high-frequency details. Additionally, we design a cross-channel attention-based (CCA) fusion module to efficiently fuse these two codes for frame decoding. Our approach achieves a high quality reconstruction of 31.2 PSNR with only 0.35M parameters thanks to separate static and dynamic codes representation and outperforms existing NeRV methods in many downstream tasks. Our project website is at this https URL.
Implicit neural representations for video (NeRV) have recently become a novel way for high-quality video representation. However, existing works employ a single network to represent the entire video, which implicitly confuses static and dynamic information. This leads to an inability to effectively compress the redundant static information and lack the explicitly modeling of global temporal-coherent dynamic details. 为了解决上述问题,我们提出了DS-NeRV,它将视频分解为稀疏可学习静态代码和动态代码,无需显式光流或残差监督。通过设置两个代码的不同采样率,并应用加权求和插值采样方法,DS-NeRV有效地利用冗余静态信息,同时保留高频细节。此外,我们还设计了一个跨通道关注(CCA)融合模块,用于有效地融合这两个代码进行帧解码。 我们的方法通过单独的静态和动态代码表示实现了31.2 PSNR的高质量重建,同时仅使用0.35M个参数。这使得我们在许多下游任务中超过了现有的NeRV方法。我们的项目网站是https://www.google.com/url?q=https://github.com/dantianzeng/DS-NeRV。
https://arxiv.org/abs/2403.15679
In minimally invasive endovascular procedures, contrast-enhanced angiography remains the most robust imaging technique. However, it is at the expense of the patient and clinician's health due to prolonged radiation exposure. As an alternative, interventional ultrasound has notable benefits such as being radiation-free, fast to deploy, and having a small footprint in the operating room. Yet, ultrasound is hard to interpret, and highly prone to artifacts and noise. Additionally, interventional radiologists must undergo extensive training before they become qualified to diagnose and treat patients effectively, leading to a shortage of staff, and a lack of open-source datasets. In this work, we seek to address both problems by introducing a self-supervised deep learning architecture to segment catheters in longitudinal ultrasound images, without demanding any labeled data. The network architecture builds upon AiAReSeg, a segmentation transformer built with the Attention in Attention mechanism, and is capable of learning feature changes across time and space. To facilitate training, we used synthetic ultrasound data based on physics-driven catheter insertion simulations, and translated the data into a unique CT-Ultrasound common domain, CACTUSS, to improve the segmentation performance. We generated ground truth segmentation masks by computing the optical flow between adjacent frames using FlowNet2, and performed thresholding to obtain a binary map estimate. Finally, we validated our model on a test dataset, consisting of unseen synthetic data and images collected from silicon aorta phantoms, thus demonstrating its potential for applications to clinical data in the future.
在最小侵入性介入手术中,增强型内窥镜血管造影仍然是最佳成像技术。然而,由于长时间的辐射暴露,患者和临床医生的健康受到影响。作为替代方案,介入超声具有明显的优势,例如它是辐射free的,部署快速,操作室占用空间小。然而,超声很难解释,易出现伪像和噪声。此外,介入放射科医生在成为有效地诊断和治疗患者之前,必须接受广泛培训,导致缺乏工作人员,缺乏开源数据集。在这项工作中,我们试图通过引入自监督的深度学习架构来解决这两个问题,而不需要任何标记数据。 网络架构基于AiAReSeg,这是使用注意力机制构建的分割转换器,能够学习时间和空间中特征的变化。为了促进训练,我们使用了基于物理仿真引导的穿刺器插入模拟的合成超声数据,并将数据转换为独特的CT-Ultrasound共同领域,CACTUSS,以提高分割性能。我们通过计算相邻帧之间的光流,计算光学流的伪像,并执行阈值以获得二进制图估计。最后,我们在包含未见过的合成数据和从硅动脉幻象收集的图像的测试数据集上验证了我们的模型,从而展示了其在未来应用于临床数据的可能性。
https://arxiv.org/abs/2403.14465
Visual simultaneous localization and mapping (VSLAM) has broad applications, with state-of-the-art methods leveraging deep neural networks for better robustness and applicability. However, there is a lack of research in fusing these learning-based methods with multi-sensor information, which could be indispensable to push related applications to large-scale and complex scenarios. In this paper, we tightly integrate the trainable deep dense bundle adjustment (DBA) with multi-sensor information through a factor graph. In the framework, recurrent optical flow and DBA are performed among sequential images. The Hessian information derived from DBA is fed into a generic factor graph for multi-sensor fusion, which employs a sliding window and supports probabilistic marginalization. A pipeline for visual-inertial integration is firstly developed, which provides the minimum ability of metric-scale localization and mapping. Furthermore, other sensors (e.g., global navigation satellite system) are integrated for driftless and geo-referencing functionality. Extensive tests are conducted on both public datasets and self-collected datasets. The results validate the superior localization performance of our approach, which enables real-time dense mapping in large-scale environments. The code has been made open-source (this https URL).
视觉同时定位和映射(VSLAM)具有广泛的应用,最先进的方法利用深度神经网络的优点来提高其稳健性和适用性。然而,将这些基于学习的方法与多传感器信息相结合的研究还很少,这对于推动相关应用向大规模和复杂场景实现至关重要。在本文中,我们将通过因子图将可训练的深度密集卷积 bundle adjustment(DBA)与多传感器信息相结合。在框架中,连续光流和 DBA 在序列图像之间执行。从 DBA 获得的 Hessian 信息被输入到通用因子图进行多传感器融合,该框架采用滑动窗口并支持概率边际。首先开发了视觉-惯性整合的流程,提供了最小的大规模局部定位和映射能力。此外,还集成了其他传感器(例如全球导航卫星系统)以实现无漂移和地理参考功能。在公开数据集和自收集数据集上进行了广泛的测试。测试结果证实了我们的方法在大型环境中的卓越定位性能,从而实现了在大型环境中的实时密集映射。该代码已公开开源(此 https URL)。
https://arxiv.org/abs/2403.13714
Diffusion models have achieved great success in image generation. However, when leveraging this idea for video generation, we face significant challenges in maintaining the consistency and continuity across video frames. This is mainly caused by the lack of an effective framework to align frames of videos with desired temporal features while preserving consistent semantic and stochastic features. In this work, we propose a novel Sector-Shaped Diffusion Model (S2DM) whose sector-shaped diffusion region is formed by a set of ray-shaped reverse diffusion processes starting at the same noise point. S2DM can generate a group of intrinsically related data sharing the same semantic and stochastic features while varying on temporal features with appropriate guided conditions. We apply S2DM to video generation tasks, and explore the use of optical flow as temporal conditions. Our experimental results show that S2DM outperforms many existing methods in the task of video generation without any temporal-feature modelling modules. For text-to-video generation tasks where temporal conditions are not explicitly given, we propose a two-stage generation strategy which can decouple the generation of temporal features from semantic-content features. We show that, without additional training, our model integrated with another temporal conditions generative model can still achieve comparable performance with existing works. Our results can be viewd at this https URL.
扩散模型在图像生成方面取得了巨大的成功。然而,在将这一想法应用于视频生成时,我们在保持帧之间的一致性和连续性方面面临着巨大的挑战。这主要是由缺乏一个有效的框架来在保留一致的语义和随机特征的同时,将视频帧与所需的时间特征对齐所引起的。在这项工作中,我们提出了一个新颖的Sector-Shaped Diffusion Model(S2DM),其扩散区域由一组以相同噪声点为起点的凸形反扩散过程形成。S2DM可以在具有相同语义和随机特征的一组数据中生成一组内插关系。我们将S2DM应用于视频生成任务,并探讨了使用光流作为时间条件。我们的实验结果表明,在没有任何时间特征建模模块的情况下,S2DM在视频生成任务中优于许多现有方法。对于没有明确给出时间条件的文本到视频生成任务,我们提出了一个两阶段生成策略,可以将生成时间特征与语义内容特征解耦。我们证明了,在没有额外训练的情况下,我们与另一个时间条件生成模型集成的模型可以与现有工作达到相当不错的性能。我们的结果可以在以下链接查看:https://url.cn/
https://arxiv.org/abs/2403.13408
In this paper, we propose a simple and strong framework for Tracking Any Point with TRansformers (TAPTR). Based on the observation that point tracking bears a great resemblance to object detection and tracking, we borrow designs from DETR-like algorithms to address the task of TAP. In the proposed framework, in each video frame, each tracking point is represented as a point query, which consists of a positional part and a content part. As in DETR, each query (its position and content feature) is naturally updated layer by layer. Its visibility is predicted by its updated content feature. Queries belonging to the same tracking point can exchange information through self-attention along the temporal dimension. As all such operations are well-designed in DETR-like algorithms, the model is conceptually very simple. We also adopt some useful designs such as cost volume from optical flow models and develop simple designs to provide long temporal information while mitigating the feature drifting issue. Our framework demonstrates strong performance with state-of-the-art performance on various TAP datasets with faster inference speed.
在本文中,我们提出了一个简单而强大的跟踪任意点的框架 TRansformers (TAPTR)。基于观察到点跟踪与目标检测和跟踪具有很大的相似性,我们借鉴了DETR 类似算法的思想来解决 TAP 任务。在所提出的框架中,每个视频帧,每个跟踪点都表示为一个点查询,由位置部分和内容部分组成。与 DETR 类似,每个查询(其位置和内容特征)都在逐层自然更新。它的可见性由更新的内容特征预测。属于同一流道跟踪的查询可以在时间维度上通过自注意力进行信息交换。由于所有这些操作在 DETR 类似算法中都设计得很好,因此模型在概念上非常简单。我们还采用了一些有用的设计,如从光流模型中使用的成本体积,并开发了一些简单的设计来提供长时间信息,同时减轻特征漂移问题。我们的框架在各种 TAP 数据集上的表现非常出色,具有与最先进的性能相当的速度。
https://arxiv.org/abs/2403.13042
Creating 4D fields of Gaussian Splatting from images or videos is a challenging task due to its under-constrained nature. While the optimization can draw photometric reference from the input videos or be regulated by generative models, directly supervising Gaussian motions remains underexplored. In this paper, we introduce a novel concept, Gaussian flow, which connects the dynamics of 3D Gaussians and pixel velocities between consecutive frames. The Gaussian flow can be efficiently obtained by splatting Gaussian dynamics into the image space. This differentiable process enables direct dynamic supervision from optical flow. Our method significantly benefits 4D dynamic content generation and 4D novel view synthesis with Gaussian Splatting, especially for contents with rich motions that are hard to be handled by existing methods. The common color drifting issue that happens in 4D generation is also resolved with improved Guassian dynamics. Superior visual quality on extensive experiments demonstrates our method's effectiveness. Quantitative and qualitative evaluations show that our method achieves state-of-the-art results on both tasks of 4D generation and 4D novel view synthesis. Project page: this https URL
创建从图像或视频中创建高斯平铺的4D场是一项具有挑战性的任务,因为其不约束的性质。虽然优化可以从输入视频中提取光度参考或由生成模型进行调节,但直接指导高斯运动的探索仍然很少。在本文中,我们引入了一个新概念——高斯流,它将连续帧之间的高斯动态连接起来。高斯流可以通过将高斯动态平铺到图像空间来 efficiently获得。这种不同寻常的过程使得可以从光流中直接进行动态监督。我们的方法在4D动态内容生成和4D新视图合成中显著提高了高斯平铺的效果,特别是对于难以处理现有方法的丰富动量的内容。与生成器中生成的4D内容相比,我们的方法在很大程度上改善了视觉质量。在广泛的实验中,我们的方法在4D生成和4D新视图合成任务上都取得了最先进的成果。定量和定性评估表明,我们的方法在4D生成和4D新视图合成方面实现了最先进的结果。项目页面:此链接
https://arxiv.org/abs/2403.12365
Despite the progress of learning-based methods for 6D object pose estimation, the trade-off between accuracy and scalability for novel objects still exists. Specifically, previous methods for novel objects do not make good use of the target object's 3D shape information since they focus on generalization by processing the shape indirectly, making them less effective. We present GenFlow, an approach that enables both accuracy and generalization to novel objects with the guidance of the target object's shape. Our method predicts optical flow between the rendered image and the observed image and refines the 6D pose iteratively. It boosts the performance by a constraint of the 3D shape and the generalizable geometric knowledge learned from an end-to-end differentiable system. We further improve our model by designing a cascade network architecture to exploit the multi-scale correlations and coarse-to-fine refinement. GenFlow ranked first on the unseen object pose estimation benchmarks in both the RGB and RGB-D cases. It also achieves performance competitive with existing state-of-the-art methods for the seen object pose estimation without any fine-tuning.
尽管基于学习的6D物体姿态估计方法已经取得了进步,但新物体的准确性和可扩展性之间的权衡仍然存在。具体来说,以前的方法没有充分利用目标物体的3D形状信息,因为它们通过间接处理形状来集中化,导致它们的效果不佳。我们提出了GenFlow,一种指导目标物体形状的途径,以实现对新物体的准确性和泛化的同时优化。我们的方法预测渲染图像和观测图像之间的光流,并逐步优化6D姿态。通过约束3D形状和从端到端可导系统的几何知识,它提高了性能。我们进一步通过设计级联网络架构来利用多尺度相关性和高低层次精度的细化来改进我们的模型。GenFlow在未见过的物体姿态估计基准中排在了RGB和RGB-D情况下的第一位。它还在没有任何微调的情况下,与现有的先进方法在见过的物体姿态估计方面实现了竞争力的性能。
https://arxiv.org/abs/2403.11510
3D Gaussian Splatting (3DGS) has become an emerging tool for dynamic scene reconstruction. However, existing methods focus mainly on extending static 3DGS into a time-variant representation, while overlooking the rich motion information carried by 2D observations, thus suffering from performance degradation and model redundancy. To address the above problem, we propose a novel motion-aware enhancement framework for dynamic scene reconstruction, which mines useful motion cues from optical flow to improve different paradigms of dynamic 3DGS. Specifically, we first establish a correspondence between 3D Gaussian movements and pixel-level flow. Then a novel flow augmentation method is introduced with additional insights into uncertainty and loss collaboration. Moreover, for the prevalent deformation-based paradigm that presents a harder optimization problem, a transient-aware deformation auxiliary module is proposed. We conduct extensive experiments on both multi-view and monocular scenes to verify the merits of our work. Compared with the baselines, our method shows significant superiority in both rendering quality and efficiency.
3D Gaussian Splatting(3DGS)已成为动态场景重构的新兴工具。然而,现有方法主要关注将静态3DGS扩展到时间可变的表示,而忽略了2D观测中携带的丰富运动信息,从而导致性能下降和模型冗余。为了解决上述问题,我们提出了一个全新的运动感知增强框架,该框架从光学流中挖掘有用的运动线索来改善不同动态3DGS范式的性能。具体来说,我们首先建立了3D高斯运动的像素级运动对应关系。然后,我们引入了一种新的流增强方法,该方法在 uncertainty 和 loss collaboration 方面有了更深入的理解。此外,针对普遍的变形基础范式,呈现更困难的优化问题,我们提出了一个暂时的变形辅助模块。我们在多视角和单目场景上进行了广泛的实验,以验证我们工作的优点。与基线相比,我们的方法在渲染质量和效率方面都具有显著的优势。
https://arxiv.org/abs/2403.11447
We propose a deep learning based novel prediction framework for enhanced bandwidth reduction in motion transfer enabled video applications such as video conferencing, virtual reality gaming and privacy preservation for patient health monitoring. To model complex motion, we use the First Order Motion Model (FOMM) that represents dynamic objects using learned keypoints along with their local affine transformations. Keypoints are extracted by a self-supervised keypoint detector and organized in a time series corresponding to the video frames. Prediction of keypoints, to enable transmission using lower frames per second on the source device, is performed using a Variational Recurrent Neural Network (VRNN). The predicted keypoints are then synthesized to video frames using an optical flow estimator and a generator network. This efficacy of leveraging keypoint based representations in conjunction with VRNN based prediction for both video animation and reconstruction is demonstrated on three diverse datasets. For real-time applications, our results show the effectiveness of our proposed architecture by enabling up to 2x additional bandwidth reduction over existing keypoint based video motion transfer frameworks without significantly compromising video quality.
我们提出了一个基于深度学习的全新预测框架,用于在视频传输增强应用中(如视频会议、虚拟现实游戏和隐私保护患者健康监测)实现带宽减少。为了模拟复杂运动,我们使用了First Order Motion Model(FOMM),它通过学习关键点来表示动态物体,并利用其局部平移变换。通过自监督的关键点检测器提取关键点,并组织成与视频帧相对应的时间序列。使用Variational Recurrent Neural Network(VRNN)进行预测,以实现源设备上每秒较低的帧数传输。然后使用光流估计算法和一个生成网络将预测的关键点合成到视频帧中。在三个不同的数据集上证明,在视频动画和重建中利用关键点为基础的表示与VRNN为基础的预测相结合,可以实现显著的带宽减少,同时不牺牲视频质量。 我们提出的架构在实时应用中的效果通过实现超过现有基于关键点的视频传输框架2x的带宽减少,同时显著提高视频质量,得到了验证。
https://arxiv.org/abs/2403.11337
Real-time high-accuracy optical flow estimation is a crucial component in various applications, including localization and mapping in robotics, object tracking, and activity recognition in computer vision. While recent learning-based optical flow methods have achieved high accuracy, they often come with heavy computation costs. In this paper, we propose a highly efficient optical flow architecture, called NeuFlow, that addresses both high accuracy and computational cost concerns. The architecture follows a global-to-local scheme. Given the features of the input images extracted at different spatial resolutions, global matching is employed to estimate an initial optical flow on the 1/16 resolution, capturing large displacement, which is then refined on the 1/8 resolution with lightweight CNN layers for better accuracy. We evaluate our approach on Jetson Orin Nano and RTX 2080 to demonstrate efficiency improvements across different computing platforms. We achieve a notable 10x-80x speedup compared to several state-of-the-art methods, while maintaining comparable accuracy. Our approach achieves around 30 FPS on edge computing platforms, which represents a significant breakthrough in deploying complex computer vision tasks such as SLAM on small robots like drones. The full training and evaluation code is available at this https URL.
实时高精度光流估计是各种应用的关键组件,包括机器人定位和地图、目标跟踪和计算机视觉活动识别。虽然最近基于学习的光流方法已经达到高准确度,但它们通常伴随着沉重的计算成本。在本文中,我们提出了一个高效的光流架构,称为NeuFlow,该架构解决了高准确度和计算成本的问题。架构遵循全局到局部方案。根据不同分辨率提取的输入图像的特征,采用全局匹配来估计初始光流在1/16分辨率上,捕获大的位移,然后在1/8分辨率上通过轻量级的CNN层进行微调,以提高准确性。我们在Jetson Orin Nano和RTX 2080上评估我们的方法,以证明不同计算平台上的效率改进。我们实现了与几个最先进方法相当的增长速度,同时保持较高的准确性。我们的方法在边缘计算平台上达到约30 FPS,这标志着在部署类似SLAM等复杂计算机视觉任务的小型机器人方面取得了显著的突破。完整的训练和评估代码可在此处访问:https://url.
https://arxiv.org/abs/2403.10425
Surgical instrument segmentation in laparoscopy is essential for computer-assisted surgical systems. Despite the Deep Learning progress in recent years, the dynamic setting of laparoscopic surgery still presents challenges for precise segmentation. The nnU-Net framework excelled in semantic segmentation analyzing single frames without temporal information. The framework's ease of use, including its ability to be automatically configured, and its low expertise requirements, have made it a popular base framework for comparisons. Optical flow (OF) is a tool commonly used in video tasks to estimate motion and represent it in a single frame, containing temporal information. This work seeks to employ OF maps as an additional input to the nnU-Net architecture to improve its performance in the surgical instrument segmentation task, taking advantage of the fact that instruments are the main moving objects in the surgical field. With this new input, the temporal component would be indirectly added without modifying the architecture. Using CholecSeg8k dataset, three different representations of movement were estimated and used as new inputs, comparing them with a baseline model. Results showed that the use of OF maps improves the detection of classes with high movement, even when these are scarce in the dataset. To further improve performance, future work may focus on implementing other OF-preserving augmentations.
腹腔镜手术中手术器械分割对于计算机辅助手术系统至关重要。尽管近年来深度学习取得了进步,但腹腔镜手术的动态设置仍然存在对于精确分割的挑战。nnU-Net框架在语义分割中分析单帧数据时表现出色,而无需考虑时间信息。该框架的易用性(包括自动配置的能力)以及低专业要求,使其成为比较激烈的基础框架。 光流(OF)是一种常用的视频任务工具,用于估计运动并将其表示为一帧,包含时间信息。这项工作旨在将OF地图作为nnU-Net架构的额外输入,以提高其在手术器械分割任务中的性能,并利用手术领域中器械是主要运动对象的事实。通过这种新输入,可以间接地添加时间组件而无需修改架构。使用CholecSeg8k数据集,估计了三种不同的运动表示,并将其用作新的输入,与基线模型进行比较。结果显示,使用OF地图可以提高对于数据集中运动类别的检测,即使这些类别在数据集中较为稀缺。为了进一步提高性能,未来的工作可以关注实现其他OF保留的增强。
https://arxiv.org/abs/2403.10216
Video-based surgical instrument segmentation plays an important role in robot-assisted surgeries. Unlike supervised settings, unsupervised segmentation relies heavily on motion cues, which are challenging to discern due to the typically lower quality of optical flow in surgical footage compared to natural scenes. This presents a considerable burden for the advancement of unsupervised segmentation techniques. In our work, we address the challenge of enhancing model performance despite the inherent limitations of low-quality optical flow. Our methodology employs a three-pronged approach: extracting boundaries directly from the optical flow, selectively discarding frames with inferior flow quality, and employing a fine-tuning process with variable frame rates. We thoroughly evaluate our strategy on the EndoVis2017 VOS dataset and Endovis2017 Challenge dataset, where our model demonstrates promising results, achieving a mean Intersection-over-Union (mIoU) of 0.75 and 0.72, respectively. Our findings suggest that our approach can greatly decrease the need for manual annotations in clinical environments and may facilitate the annotation process for new datasets. The code is available at this https URL
基于视频的手术器械分割在机器人辅助手术中扮演着重要角色。与监督设置不同,无监督分割依赖于运动线索,而由于手术视频通常比自然场景的图像质量较低,因此很难辨别。这为无监督分割技术的进步带来了巨大的负担。在我们的工作中,我们探讨了如何在不降低模型性能的前提下提高模型性能。我们的方法采用了一种三明治策略:直接从光学流中提取边界,有选择地丢弃质量较低的帧,并采用具有变帧率的微调过程。我们在EndoVis2017 VOS数据集和Endovis2017挑战数据集上进行彻底评估,结果显示我们的模型在这些数据集上取得了良好的效果,平均交集与并集(mIoU)分别为0.75和0.72。我们的研究结果表明,我们的方法可以在临床环境中大大减少手动注释的需求,并为新数据集的标注过程提供便利。代码可在此处访问:https://www.thorlabs.com/newgrouppage9.cfm?objectgroup_id=11375
https://arxiv.org/abs/2403.10039
Ego-to-exo video generation refers to generating the corresponding exocentric video according to the egocentric video, providing valuable applications in AR/VR and embodied AI. Benefiting from advancements in diffusion model techniques, notable progress has been achieved in video generation. However, existing methods build upon the spatiotemporal consistency assumptions between adjacent frames, which cannot be satisfied in the ego-to-exo scenarios due to drastic changes in views. To this end, this paper proposes an Intention-Driven Ego-to-exo video generation framework (IDE) that leverages action intention consisting of human movement and action description as view-independent representation to guide video generation, preserving the consistency of content and motion. Specifically, the egocentric head trajectory is first estimated through multi-view stereo matching. Then, cross-view feature perception module is introduced to establish correspondences between exo- and ego- views, guiding the trajectory transformation module to infer human full-body movement from the head trajectory. Meanwhile, we present an action description unit that maps the action semantics into the feature space consistent with the exocentric image. Finally, the inferred human movement and high-level action descriptions jointly guide the generation of exocentric motion and interaction content (i.e., corresponding optical flow and occlusion maps) in the backward process of the diffusion model, ultimately warping them into the corresponding exocentric video. We conduct extensive experiments on the relevant dataset with diverse exo-ego video pairs, and our IDE outperforms state-of-the-art models in both subjective and objective assessments, demonstrating its efficacy in ego-to-exo video generation.
自我到外部视频生成指的是根据自我中心视频生成相应的外向视频,为AR/VR和实体人工智能提供有价值应用。通过扩散模型技术的进步,在视频生成方面取得了显著的进步。然而,由于在自我到外部场景中视图的急剧变化,现有方法在很大程度上依赖于相邻帧之间的时空一致性假设。因此,本文提出了一种自意图驱动的自我到外部视频生成框架(IDE),将人类运动和动作描述作为无依赖的视图独立表示,以指导视频生成,保留内容和运动的一致性。具体来说,首先通过多视角立体匹配估计自适应头部轨迹。然后,引入跨视图特征感知模块来建立自适应和自我意识视图之间的对应关系,引导轨迹转换模块从头部轨迹推断出人类全身运动。同时,我们提出了一个动作描述单元,将动作语义映射到与自适应图像一致的特征空间。最后,根据推断出的人类运动和高层次动作描述共同指导扩散模型的反向过程,最终扭曲它们成为相应的自适应视频。我们在相关数据集上进行广泛的实验,测试各种自适应和目标评估,我们的IDE在主观和客观评估中都超过了最先进的模型,证明了其在自我到外部视频生成方面的有效性。
https://arxiv.org/abs/2403.09194
As the use of neuromorphic, event-based vision sensors expands, the need for compression of their output streams has increased. While their operational principle ensures event streams are spatially sparse, the high temporal resolution of the sensors can result in high data rates from the sensor depending on scene dynamics. For systems operating in communication-bandwidth-constrained and power-constrained environments, it is essential to compress these streams before transmitting them to a remote receiver. Therefore, we introduce a flow-based method for the real-time asynchronous compression of event streams as they are generated. This method leverages real-time optical flow estimates to predict future events without needing to transmit them, therefore, drastically reducing the amount of data transmitted. The flow-based compression introduced is evaluated using a variety of methods including spatiotemporal distance between event streams. The introduced method itself is shown to achieve an average compression ratio of 2.81 on a variety of event-camera datasets with the evaluation configuration used. That compression is achieved with a median temporal error of 0.48 ms and an average spatiotemporal event-stream distance of 3.07. When combined with LZMA compression for non-real-time applications, our method can achieve state-of-the-art average compression ratios ranging from 10.45 to 17.24. Additionally, we demonstrate that the proposed prediction algorithm is capable of performing real-time, low-latency event prediction.
随着神经形态、基于事件的视觉传感器应用的增加,对它们的输出流进行压缩的需求已经增加。虽然它们的操作原理确保了事件流是空间稀疏的,但传感器的 high temporal resolution 可能导致传感器根据场景动态产生高数据率。对于在带宽和功率受限的环境中运行的系统,在将数据传输到远程接收器之前压缩这些流至关重要。因此,我们引入了一种基于流的实时异步压缩事件流的方法。这种方法利用了实时光流估计来预测未来的事件,而不需要传输它们,从而大大减少了传输的数据量。基于流压缩的方法使用多种方法进行评估,包括事件流之间的时空距离。该引入的方法在各种事件相机数据集上的平均压缩比为2.81,评估配置下实现。通过平均时间误差为0.48毫秒和平均时空事件流距离为3.07,该压缩实现。当与非实时应用的 LZMA 压缩相结合时,我们的方法可以实现从 10.45 到 17.24 的最佳平均压缩比。此外,我们证明了所提出的预测算法具有实时、低延迟事件预测的能力。
https://arxiv.org/abs/2403.08086
3D object detection is one of the most important components in any Self-Driving stack, but current state-of-the-art (SOTA) lidar object detectors require costly & slow manual annotation of 3D bounding boxes to perform well. Recently, several methods emerged to generate pseudo ground truth without human supervision, however, all of these methods have various drawbacks: Some methods require sensor rigs with full camera coverage and accurate calibration, partly supplemented by an auxiliary optical flow engine. Others require expensive high-precision localization to find objects that disappeared over multiple drives. We introduce a novel self-supervised method to train SOTA lidar object detection networks which works on unlabeled sequences of lidar point clouds only, which we call trajectory-regularized self-training. It utilizes a SOTA self-supervised lidar scene flow network under the hood to generate, track, and iteratively refine pseudo ground truth. We demonstrate the effectiveness of our approach for multiple SOTA object detection networks across multiple real-world datasets. Code will be released.
3D物体检测是任何一个自动驾驶堆栈中最重要的组件之一,但是目前最先进的(SOTA)激光雷达物体检测器需要昂贵且耗时的手动注释3D边界框才能表现出色。最近,出现了几种无需人工监督生成伪标签的方法,然而,这些方法都有各种缺陷:有些方法需要带有完整相机覆盖且准确校准的传感器装置,部分由辅助光学流引擎补充。其他方法需要昂贵的精确度来寻找在多辆车上消失的对象。我们提出了一种新颖的自监督方法来训练SOTA激光雷达物体检测网络,该方法仅在未经标注的激光点云序列上工作,我们称之为轨迹正则化自监督训练。它利用了SOTA自监督激光场景流网络生成、跟踪和迭代优化伪标签。我们在多个现实世界数据集上证明了我们方法的效力。代码将发布。
https://arxiv.org/abs/2403.07071
To enhance localization accuracy in urban environments, an innovative LiDAR-Visual-Inertial odometry, named HDA-LVIO, is proposed by employing hybrid data association. The proposed HDA_LVIO system can be divided into two subsystems: the LiDAR-Inertial subsystem (LIS) and the Visual-Inertial subsystem (VIS). In the LIS, the LiDAR pointcloud is utilized to calculate the Iterative Closest Point (ICP) error, serving as the measurement value of Error State Iterated Kalman Filter (ESIKF) to construct the global map. In the VIS, an incremental method is firstly employed to adaptively extract planes from the global map. And the centroids of these planes are projected onto the image to obtain projection points. Then, feature points are extracted from the image and tracked along with projection points using Lucas-Kanade (LK) optical flow. Next, leveraging the vehicle states from previous intervals, sliding window optimization is performed to estimate the depth of feature points. Concurrently, a method based on epipolar geometric constraints is proposed to address tracking failures for feature points, which can improve the accuracy of depth estimation for feature points by ensuring sufficient parallax within the sliding window. Subsequently, the feature points and projection points are hybridly associated to construct reprojection error, serving as the measurement value of ESIKF to estimate vehicle states. Finally, the localization accuracy of the proposed HDA-LVIO is validated using public datasets and data from our equipment. The results demonstrate that the proposed algorithm achieves obviously improvement in localization accuracy compared to various existing algorithms.
为提高城市环境中的局部化准确性,一种创新的多传感器融合视觉-惯性导航系统(HDA-LVIO)被提出。该系统采用混合数据关联技术进行分割,可分为两个子系统:激光雷达-惯性导航子系统(LIS)和视觉-惯性导航子系统(VIS)。在LIS中,利用激光雷达点云计算迭代最近点(ICP)误差,作为误差状态迭代卡尔曼滤波器(ESIKF)的测量值来构建全局地图。在VIS中,首先采用累积方法适应性地从全局地图中提取平面。然后将这些平面的中点投影到图像上,获得投影点。接着,从图像中提取特征点,并使用卢卡斯-凯南德(LK)光度跟踪它们。接下来,通过利用前一间隔的车辆状态,进行滑动窗口优化来估计特征点的深度。同时,基于极化几何约束的方法被提出来解决特征点跟踪失败的问题,从而提高深度估计的精度,确保滑动窗口内的相干性足够。然后,特征点和投影点混合关联以构建重投影误差,作为ESIKF估计车辆状态的测量值。最后,通过使用公开数据集和我们的设备数据来验证所提出的HDA-LVIO的定位准确性。结果表明,与各种现有算法相比,所提出的算法在局部化准确性方面明显取得了改进。
https://arxiv.org/abs/2403.06590