We introduce a new system for Multi-Session SLAM, which tracks camera motion across multiple disjoint videos under a single global reference. Our approach couples the prediction of optical flow with solver layers to estimate camera pose. The backbone is trained end-to-end using a novel differentiable solver for wide-baseline two-view pose. The full system can connect disjoint sequences, perform visual odometry, and global optimization. Compared to existing approaches, our design is accurate and robust to catastrophic failures. Code is available at this http URL
我们介绍了一种新的多会话SLAM系统,该系统在单个全局参考下跟踪相机运动。我们的方法将预测光流与求解层相结合来估计相机姿态。骨架使用一种新的具有差分隐私的求解器进行端到端训练,用于估计宽基线两视图姿态。完整的系统可以连接离散序列,执行视觉姿态估计和全局优化。与现有方法相比,我们的设计准确且对灾难性故障具有鲁棒性。代码可在此处下载:http://www.example.com
https://arxiv.org/abs/2404.15263
This paper introduces FlowMap, an end-to-end differentiable method that solves for precise camera poses, camera intrinsics, and per-frame dense depth of a video sequence. Our method performs per-video gradient-descent minimization of a simple least-squares objective that compares the optical flow induced by depth, intrinsics, and poses against correspondences obtained via off-the-shelf optical flow and point tracking. Alongside the use of point tracks to encourage long-term geometric consistency, we introduce differentiable re-parameterizations of depth, intrinsics, and pose that are amenable to first-order optimization. We empirically show that camera parameters and dense depth recovered by our method enable photo-realistic novel view synthesis on 360-degree trajectories using Gaussian Splatting. Our method not only far outperforms prior gradient-descent based bundle adjustment methods, but surprisingly performs on par with COLMAP, the state-of-the-art SfM method, on the downstream task of 360-degree novel view synthesis (even though our method is purely gradient-descent based, fully differentiable, and presents a complete departure from conventional SfM).
本文介绍了FlowMap,一种端到端的不同iable方法,用于求解视频序列中的精确相机姿态、相机内参和逐帧密集深度。我们的方法通过简单最小二乘目标函数对深度、内参和姿态引起的光学流进行逐视频梯度下降最小化。在点跟踪的使用下,我们引入了可进行一级优化的深度、内参和姿态的可导性重新参数化。我们通过实验验证,我们的方法能够使用高斯平铺实现照片现实感的360度轨迹合成。与基于梯度的 bundle adjustment 方法相比,我们的方法不仅远远超过了先前的结果,而且与最先进的SfM方法COLMAP在360度新视图合成下游任务的表现相当。尽管我们的方法是基于梯度的,完全不同导,完全与传统SfM不同,但它成功地克服了传统SfM的局限性。
https://arxiv.org/abs/2404.15259
Given a source portrait, the automatic human body reshaping task aims at editing it to an aesthetic body shape. As the technology has been widely used in media, several methods have been proposed mainly focusing on generating optical flow to warp the body shape. However, those previous works only consider the local transformation of different body parts (arms, torso, and legs), ignoring the global affinity, and limiting the capacity to ensure consistency and quality across the entire body. In this paper, we propose a novel Adaptive Affinity-Graph Network (AAGN), which extracts the global affinity between different body parts to enhance the quality of the generated optical flow. Specifically, our AAGN primarily introduces the following designs: (1) we propose an Adaptive Affinity-Graph (AAG) Block that leverages the characteristic of a fully connected graph. AAG represents different body parts as nodes in an adaptive fully connected graph and captures all the affinities between nodes to obtain a global affinity map. The design could better improve the consistency between body parts. (2) Besides, for high-frequency details are crucial for photo aesthetics, a Body Shape Discriminator (BSD) is designed to extract information from both high-frequency and spatial domain. Particularly, an SRM filter is utilized to extract high-frequency details, which are combined with spatial features as input to the BSD. With this design, BSD guides the Flow Generator (FG) to pay attention to various fine details rather than rigid pixel-level fitting. Extensive experiments conducted on the BR-5K dataset demonstrate that our framework significantly enhances the aesthetic appeal of reshaped photos, marginally surpassing all previous work to achieve state-of-the-art in all evaluation metrics.
给定一个源肖像,自动人体重塑任务的目的是将它们编辑成美学身材形状。随着这项技术在媒体中的广泛应用,已经提出了几种主要关注于生成光学流来扭曲身材形状的方法。然而,这些先前的作品仅考虑了不同身体部分(手臂、躯干和腿)的局部变换,忽略了全局关联,并限制了在整个身体中确保一致性和质量的能力。在本文中,我们提出了一种新颖的自适应亲和性图网络(AAGN),旨在提高生成光学流的质量。具体来说,我们的自适应亲和性图网络主要引入了以下设计:(1)我们提出了一个自适应亲和性图(AAG)模块,利用了完全连接图的特性。AAG 将不同的身体部分节点表示为适应性完全连接图中的节点,并捕获所有节点之间的关联以获得全局关联图。这个设计可以更好地改善身体部分之间的一致性。(2)此外,对于高频率细节对于照片美学至关重要,我们设计了一个身体形状判别器(BSD),用于从高频率和空间域提取信息。特别是,使用了SRM滤波器提取高频率细节,将空间特征作为输入与BSD结合。这种设计使得BSD引导流量生成器(FG)关注各种微小细节,而不是对像素级的拟合。在BR-5K数据集上进行的大量实验证明,我们的框架显著增强了重塑照片的美学吸引力,略微超过所有先前的作品,在所有评估指标上实现了最先进水平。
https://arxiv.org/abs/2404.13983
Deep neural networks have made significant advancements in accurately estimating scene flow using point clouds, which is vital for many applications like video analysis, action recognition, and navigation. Robustness of these techniques, however, remains a concern, particularly in the face of adversarial attacks that have been proven to deceive state-of-the-art deep neural networks in many domains. Surprisingly, the robustness of scene flow networks against such attacks has not been thoroughly investigated. To address this problem, the proposed approach aims to bridge this gap by introducing adversarial white-box attacks specifically tailored for scene flow networks. Experimental results show that the generated adversarial examples obtain up to 33.7 relative degradation in average end-point error on the KITTI and FlyingThings3D datasets. The study also reveals the significant impact that attacks targeting point clouds in only one dimension or color channel have on average end-point error. Analyzing the success and failure of these attacks on the scene flow networks and their 2D optical flow network variants show a higher vulnerability for the optical flow networks.
深度神经网络在准确估计场景流方面取得了显著的进展,这对于许多应用,如视频分析、动作识别和导航至关重要。然而,这些技术的鲁棒性仍然是一个令人担忧的问题,尤其是在面对已知能够欺骗许多领域中最先进的深度神经网络的对抗攻击的情况下。令人惊讶的是,场景流网络对这种攻击的鲁棒性尚未被充分调查。为解决这个问题,所提出的方法旨在通过引入专门针对场景流网络的对抗白盒攻击来弥合这一差距。实验结果表明,生成的对抗样本在KITTI和FlyingThings3D数据集上的平均端点误差最多可降低33.7。研究还揭示了仅针对一维或颜色通道的点云攻击对平均端点误差的影响。通过分析这些攻击在场景流网络和其2D光流网络变体上的成功率和失败情况,表明光流网络具有更高的漏洞。
https://arxiv.org/abs/2404.13621
Tackling image degradation due to atmospheric turbulence, particularly in dynamic environment, remains a challenge for long-range imaging systems. Existing techniques have been primarily designed for static scenes or scenes with small motion. This paper presents the first segment-then-restore pipeline for restoring the videos of dynamic scenes in turbulent environment. We leverage mean optical flow with an unsupervised motion segmentation method to separate dynamic and static scene components prior to restoration. After camera shake compensation and segmentation, we introduce foreground/background enhancement leveraging the statistics of turbulence strength and a transformer model trained on a novel noise-based procedural turbulence generator for fast dataset augmentation. Benchmarked against existing restoration methods, our approach restores most of the geometric distortion and enhances sharpness for videos. We make our code, simulator, and data publicly available to advance the field of video restoration from turbulence: this http URL
克服由于大气扰动而导致的图像退化,特别是在动态环境中,仍然是一个长期成像系统的挑战。现有的技术主要针对静态场景或具有较小运动的场景。本文提出了第一个用于恢复动态场景视频的分割-然后-修复管道。我们利用无监督运动分割方法 mean optical flow 来分离动态和静态场景组件,在修复之前。经过相机振动补偿和分割之后,我们引入了前景/背景增强,利用湍流强度的统计信息和基于新噪声生成器的Transformer模型,进行快速数据增强。与现有的修复方法进行基准测试,我们的方法修复了大多数几何失真,并提高了视频的清晰度。我们将我们的代码、模拟器和数据公开发布,以推动视频修复领域的发展:这是 http://www.example.com
https://arxiv.org/abs/2404.13605
In this paper, we present RStab, a novel framework for video stabilization that integrates 3D multi-frame fusion through volume rendering. Departing from conventional methods, we introduce a 3D multi-frame perspective to generate stabilized images, addressing the challenge of full-frame generation while preserving structure. The core of our approach lies in Stabilized Rendering (SR), a volume rendering module, which extends beyond the image fusion by incorporating feature fusion. The core of our RStab framework lies in Stabilized Rendering (SR), a volume rendering module, fusing multi-frame information in 3D space. Specifically, SR involves warping features and colors from multiple frames by projection, fusing them into descriptors to render the stabilized image. However, the precision of warped information depends on the projection accuracy, a factor significantly influenced by dynamic regions. In response, we introduce the Adaptive Ray Range (ARR) module to integrate depth priors, adaptively defining the sampling range for the projection process. Additionally, we propose Color Correction (CC) assisting geometric constraints with optical flow for accurate color aggregation. Thanks to the three modules, our RStab demonstrates superior performance compared with previous stabilizers in the field of view (FOV), image quality, and video stability across various datasets.
在本文中,我们提出了RStab,一种新颖的视频稳定框架,它通过体积渲染实现了3D多帧融合。与传统方法不同,我们引入了一个3D多帧视角来生成稳定图像,解决了在保留结构的同时生成完整帧的挑战。我们方法的核心是基于稳定渲染(SR)的体积渲染模块,它超越了图像融合,通过引入特征融合实现了。我们的RStab框架的核心是基于稳定渲染(SR),一个体积渲染模块,将多帧信息融合在3D空间中。具体来说,SR包括通过投影扭曲多个帧的特征和颜色,将它们融合为描述符以渲染稳定图像。然而,扭曲信息的精度取决于投影精度,这是受动态区域影响的因素之一。为了应对这一挑战,我们引入了自适应范围(ARR)模块,将深度优先级集成到投影过程中,自适应地定义投影过程的采样范围。此外,我们还提出了使用光流引导的彩色校正(CC)辅助几何约束,以准确的颜色聚集。得益于这三个模块,我们的RStab在视场(FOV)、图像质量和视频稳定性方面与以前的视频稳定剂相比表现优异。
https://arxiv.org/abs/2404.12887
The objective of this paper is motion segmentation -- discovering and segmenting the moving objects in a video. This is a much studied area with numerous careful,and sometimes complex, approaches and training schemes including: self-supervised learning, learning from synthetic datasets, object-centric representations, amodal representations, and many more. Our interest in this paper is to determine if the Segment Anything model (SAM) can contribute to this task. We investigate two models for combining SAM with optical flow that harness the segmentation power of SAM with the ability of flow to discover and group moving objects. In the first model, we adapt SAM to take optical flow, rather than RGB, as an input. In the second, SAM takes RGB as an input, and flow is used as a segmentation prompt. These surprisingly simple methods, without any further modifications, outperform all previous approaches by a considerable margin in both single and multi-object benchmarks. We also extend these frame-level segmentations to sequence-level segmentations that maintain object identity. Again, this simple model outperforms previous methods on multiple video object segmentation benchmarks.
本论文的目标是运动分割,即在视频中发现和分割运动物体。这是一个已经研究广泛的领域,包括许多仔细研究过的方法,有时很复杂,包括自监督学习、从合成数据中学习、以物体为中心表示、以模式为基础表示等等。本文的兴趣在于确定Segment Anything模型(SAM)是否能为这项任务做出贡献。我们研究了两个将SAM与光学流结合的模型,利用SAM的分割能力与流发现和分组移动物体的能力。在第一个模型中,我们将SAM适应为以光学流为输入。在第二个模型中,SAM以RGB为输入,并使用流作为分割提示。这些简单的方法,没有任何进一步的修改,在单物体和多物体基准测试中显著优于所有先前的方法。我们还将这些帧级分割扩展到序列级分割,保持物体身份。再次,这个简单模型在多个视频物体分割基准测试中优于先前的方法。
https://arxiv.org/abs/2404.12389
Autonomous driving requires an accurate representation of the environment. A strategy toward high accuracy is to fuse data from several sensors. Learned Bird's-Eye View (BEV) encoders can achieve this by mapping data from individual sensors into one joint latent space. For cost-efficient camera-only systems, this provides an effective mechanism to fuse data from multiple cameras with different views. Accuracy can further be improved by aggregating sensor information over time. This is especially important in monocular camera systems to account for the lack of explicit depth and velocity measurements. Thereby, the effectiveness of developed BEV encoders crucially depends on the operators used to aggregate temporal information and on the used latent representation spaces. We analyze BEV encoders proposed in the literature and compare their effectiveness, quantifying the effects of aggregation operators and latent representations. While most existing approaches aggregate temporal information either in image or in BEV latent space, our analyses and performance comparisons suggest that these latent representations exhibit complementary strengths. Therefore, we develop a novel temporal BEV encoder, TempBEV, which integrates aggregated temporal information from both latent spaces. We consider subsequent image frames as stereo through time and leverage methods from optical flow estimation for temporal stereo encoding. Empirical evaluation on the NuScenes dataset shows a significant improvement by TempBEV over the baseline for 3D object detection and BEV segmentation. The ablation uncovers a strong synergy of joint temporal aggregation in the image and BEV latent space. These results indicate the overall effectiveness of our approach and make a strong case for aggregating temporal information in both image and BEV latent spaces.
自动驾驶需要准确地描述环境。实现高准确度的策略是将来自多个传感器的数据进行融合。通过将来自单个传感器的数据映射到联合latent空间,学习到的Bird's-Eye View (BEV)编码器可以实现这一目标。对于成本效益高的摄像头仅系统,这提供了一种将来自不同视角的数据进行融合的有效机制。通过在一段时间内聚合传感器信息,可以进一步提高准确性。这对于单目相机系统尤为重要,因为它们缺乏明确的深度和速度测量。因此,开发出的BEV编码器的有效性取决于用于聚合时间信息的操作员和使用的潜在表示空间。我们分析了许多文献中提出的BEV编码器,并比较了它们的有效性,并量化了聚合操作符和潜在表示空间的影响。虽然大多数现有方法在图像或BEV潜在空间中聚合时间信息,但我们的分析和性能比较结果表明,这些潜在表示空间表现出互补的优势。因此,我们开发了一个新颖的时间BEV编码器,TempBEV,它整合了来自两个潜在空间的时间聚合信息。我们将接下来的图像帧视为立体通过时间,并利用光学流估计的方法进行时间立体编码。在NuScenes数据集上的实证评估表明,TempBEV在3D物体检测和BEV分割方面的性能显著优于基线。消融揭示了图像和BEV潜在空间中关节时间聚合的强烈协同作用。这些结果表明,我们的方法的整体有效性,以及将时间信息在图像和BEV潜在空间中进行聚合的必要性。
https://arxiv.org/abs/2404.11803
This work addresses the landing problem of an aerial vehicle, exemplified by a simple quadrotor, on a moving platform using image-based visual servo control. First, the mathematical model of the quadrotor aircraft is introduced, followed by the design of the inner-loop control. At the second stage, the image features on the textured target plane are exploited to derive a vision-based control law. The image of the spherical centroid of a set of landmarks present in the landing target is used as a position measurement, whereas the translational optical flow is used as velocity measurement. The kinematics of the vision-based system is expressed in terms of the observable features, and the proposed control law guarantees convergence without estimating the unknown distance between the vision system and the target, which is also guaranteed to remain strictly positive, avoiding undesired collisions. The performance of the proposed control law is evaluated in MATLAB and 3-D simulation software Gazebo. Simulation results for a quadrotor UAV are provided for different velocity profiles of the moving target, showcasing the robustness of the proposed controller.
本文研究了在运动平台上使用图像为基础的视觉伺服控制来解决无人机着陆问题的方法,以一个简单的四旋翼为例。首先介绍无人机的数学模型,然后是内循环控制的设计。在第二阶段,利用纹理目标平面上的图像特征来导出视觉为基础的控制律。纹理目标座标的图像被用作位置测量,而平移光流被用作速度测量。视觉系统的运动学用可观测特征表示,而所提出的控制律保证在不需要估计视觉系统与目标之间的未知距离的情况下收敛,同时也保证该距离始终保持正值,从而避免不必要的碰撞。所提出的控制律在MATLAB和3D仿真软件Gazebo中进行了性能评估。为不同速度目标的无人机模拟了不同的速度剖面,展示了所提出的控制器的稳健性。
https://arxiv.org/abs/2404.11336
In this paper, we address the Bracket Image Restoration and Enhancement (BracketIRE) task using a novel framework, which requires restoring a high-quality high dynamic range (HDR) image from a sequence of noisy, blurred, and low dynamic range (LDR) multi-exposure RAW inputs. To overcome this challenge, we present the IREANet, which improves the multiple exposure alignment and aggregation with a Flow-guide Feature Alignment Module (FFAM) and an Enhanced Feature Aggregation Module (EFAM). Specifically, the proposed FFAM incorporates the inter-frame optical flow as guidance to facilitate the deformable alignment and spatial attention modules for better feature alignment. The EFAM further employs the proposed Enhanced Residual Block (ERB) as a foundational component, wherein a unidirectional recurrent network aggregates the aligned temporal features to better reconstruct the results. To improve model generalization and performance, we additionally employ the Bayer preserving augmentation (BayerAug) strategy to augment the multi-exposure RAW inputs. Our experimental evaluations demonstrate that the proposed IREANet shows state-of-the-art performance compared with previous methods.
在本文中,我们使用一种新框架来解决Bracket Image Restoration and Enhancement(BracketIRE)任务,该框架需要从噪声、模糊和低动态范围(LDR)的多曝光RAW输入序列中恢复高质量的高动态范围(HDR)图像。为了克服这一挑战,我们提出了IReadNet,它通过引入流量引导特征对齐模块(FFAM)和增强特征聚合模块(EFAM)来改善多曝光对齐和聚合。具体来说,所提出的FFAM利用跨帧光流作为指导,以促进可变形对齐和空间注意模块(更好的特征对齐),而EFAM则进一步采用提出的增强残差块(ERB)作为基本组件,其中单向递归网络聚集对齐的时空特征以更好地重构结果。为了提高模型的泛化能力和性能,我们还使用Bayer preserving augmentation(BayerAug)策略来增强多曝光RAW输入。我们的实验评估结果表明,与以前的方法相比,所提出的IReadNet显示出最先进的性能。
https://arxiv.org/abs/2404.10358
Spin plays a pivotal role in ball-based sports. Estimating spin becomes a key skill due to its impact on the ball's trajectory and bouncing behavior. Spin cannot be observed directly, making it inherently challenging to estimate. In table tennis, the combination of high velocity and spin renders traditional low frame rate cameras inadequate for quickly and accurately observing the ball's logo to estimate the spin due to the motion blur. Event cameras do not suffer as much from motion blur, thanks to their high temporal resolution. Moreover, the sparse nature of the event stream solves communication bandwidth limitations many frame cameras face. To the best of our knowledge, we present the first method for table tennis spin estimation using an event camera. We use ordinal time surfaces to track the ball and then isolate the events generated by the logo on the ball. Optical flow is then estimated from the extracted events to infer the ball's spin. We achieved a spin magnitude mean error of $10.7 \pm 17.3$ rps and a spin axis mean error of $32.9 \pm 38.2°$ in real time for a flying ball.
旋转在球类运动中扮演着关键角色。由于其对球轨迹和弹起行为的影响,估计旋转成为了一个关键技能。由于无法直接观察到旋转,因此估计旋转本质上具有挑战性。在乒乓球中,高速度和高旋转使得传统低帧率相机无法快速且准确地观察到球的标志,从而导致运动模糊。事件相机由于其高时间分辨率,没有像事件相机那样受到运动模糊的影响。此外,事件流稀疏的特性解决了许多帧相机面临的通信带宽限制。据我们所知,我们首先提出了一种使用事件相机进行乒乓球旋转估计的方法。我们使用序时表面跟踪球,然后从球上估计标志的事件。然后通过提取这些事件估计球的旋转。我们可以在实时飞行球中实现球旋转 magnitude 平均误差为 $10.7 \pm 17.3$ rps 和轴旋转平均误差为 $32.9 \pm 38.2^\circ$。
https://arxiv.org/abs/2404.09870
The task of face reenactment is to transfer the head motion and facial expressions from a driving video to the appearance of a source image, which may be of a different person (cross-reenactment). Most existing methods are CNN-based and estimate optical flow from the source image to the current driving frame, which is then inpainted and refined to produce the output animation. We propose a transformer-based encoder for computing a set-latent representation of the source image(s). We then predict the output color of a query pixel using a transformer-based decoder, which is conditioned with keypoints and a facial expression vector extracted from the driving frame. Latent representations of the source person are learned in a self-supervised manner that factorize their appearance, head pose, and facial expressions. Thus, they are perfectly suited for cross-reenactment. In contrast to most related work, our method naturally extends to multiple source images and can thus adapt to person-specific facial dynamics. We also propose data augmentation and regularization schemes that are necessary to prevent overfitting and support generalizability of the learned representations. We evaluated our approach in a randomized user study. The results indicate superior performance compared to the state-of-the-art in terms of motion transfer quality and temporal consistency.
面部复原的任务是将来自驾驶视频的头动量和面部表情转移到源图像的 appearance上,这可能是不同的人(跨复原)。现有的方法基于CNN,估计源图像到当前驾驶帧的视差,然后修复和优化以产生输出动画。我们提出了一种基于Transformer的编码器来计算源图像的集合潜在表示。然后,我们使用基于Transformer的解码器预测查询像素的输出颜色,其中条件基于关键点和从驾驶帧中提取的面部表情向量。 源人物的潜在表示是在自监督的方式下学习,将他们的外观、头姿势和面部表情分解成不同的组件。因此,它们非常适合跨复原。与大多数相关的工作不同,我们的方法自然地扩展到多个源图像,从而可以适应个性化的面部动态。我们还提出了数据增强和正则化方案,以防止过拟合和支持学习表示的泛化。我们在随机用户研究中评估了我们的方法。结果表明,与最先进的技术相比,在运动传递质量和时间一致性方面具有优越的性能。
https://arxiv.org/abs/2404.09736
Recently, event-based vision sensors have gained attention for autonomous driving applications, as conventional RGB cameras face limitations in handling challenging dynamic conditions. However, the availability of real-world and synthetic event-based vision datasets remains limited. In response to this gap, we present SEVD, a first-of-its-kind multi-view ego, and fixed perception synthetic event-based dataset using multiple dynamic vision sensors within the CARLA simulator. Data sequences are recorded across diverse lighting (noon, nighttime, twilight) and weather conditions (clear, cloudy, wet, rainy, foggy) with domain shifts (discrete and continuous). SEVD spans urban, suburban, rural, and highway scenes featuring various classes of objects (car, truck, van, bicycle, motorcycle, and pedestrian). Alongside event data, SEVD includes RGB imagery, depth maps, optical flow, semantic, and instance segmentation, facilitating a comprehensive understanding of the scene. Furthermore, we evaluate the dataset using state-of-the-art event-based (RED, RVT) and frame-based (YOLOv8) methods for traffic participant detection tasks and provide baseline benchmarks for assessment. Additionally, we conduct experiments to assess the synthetic event-based dataset's generalization capabilities. The dataset is available at this https URL
最近,基于事件的视觉传感器在自动驾驶应用中引起了关注,因为传统的RGB相机在处理复杂动态条件时存在局限性。然而,实世界和合成事件基于视觉数据集仍然很少可用。为了填补这一空白,我们提出了SEVD,一种前所未有的多视角自利图像和用于CARLA仿真器中的多个动态视觉传感器固定的感知合成事件基于数据集。数据序列在不同的光照(中午,夜景,黄昏)和天气条件(晴朗,云层,潮湿,雨雾)下进行记录,领域转移(离散和连续)也是多样的。SEVD涵盖了城市、郊区、农村和高速公路场景,其中包括各种类型的物体(汽车,卡车,货车,自行车,摩托车和行人)。除了事件数据之外,SEVD还包括RGB图像,深度图,光流,语义和实例分割,从而实现了对场景的全面理解。此外,我们使用最先进的基于事件的(RED,RVT)和基于帧的方法(YOLOv8)对交通参与者检测任务进行评估,并为评估提供了基准基准基准。此外,我们还进行了实验,以评估合成事件基于数据集的泛化能力。该数据集可在https://url上找到。
https://arxiv.org/abs/2404.10540
Optical flow estimation is crucial to a variety of vision tasks. Despite substantial recent advancements, achieving real-time on-device optical flow estimation remains a complex challenge. First, an optical flow model must be sufficiently lightweight to meet computation and memory constraints to ensure real-time performance on devices. Second, the necessity for real-time on-device operation imposes constraints that weaken the model's capacity to adequately handle ambiguities in flow estimation, thereby intensifying the difficulty of preserving flow accuracy. This paper introduces two synergistic techniques, Self-Cleaning Iteration (SCI) and Regression Focal Loss (RFL), designed to enhance the capabilities of optical flow models, with a focus on addressing optical flow regression ambiguities. These techniques prove particularly effective in mitigating error propagation, a prevalent issue in optical flow models that employ iterative refinement. Notably, these techniques add negligible to zero overhead in model parameters and inference latency, thereby preserving real-time on-device efficiency. The effectiveness of our proposed SCI and RFL techniques, collectively referred to as SciFlow for brevity, is demonstrated across two distinct lightweight optical flow model architectures in our experiments. Remarkably, SciFlow enables substantial reduction in error metrics (EPE and Fl-all) over the baseline models by up to 6.3% and 10.5% for in-domain scenarios and by up to 6.2% and 13.5% for cross-domain scenarios on the Sintel and KITTI 2015 datasets, respectively.
光学流估计对于各种视觉任务至关重要。尽管在最近取得了重大进展,但实现实时在设备上进行光学流估计仍然是一个复杂挑战。首先,一个光学流模型必须足够轻量化,以满足计算和内存约束,以确保在设备上实现实时性能。其次,实时在设备上操作的必要性强化了模型在流估计中应对不确定性能力的限制,从而加大了保持流准确性的难度。本文介绍了两种协同技术,自清洁迭代(SCI)和回归焦点损失(RFL),旨在增强光学流模型的能力,特别关注解决光学流回归不确定性的问题。这些技术在减轻错误传播方面特别有效,这是在光学流模型中采用迭代精炼方法时普遍存在的问题。值得注意的是,这些技术在模型参数和推理延迟上增加的微不足道的开销可以保持实时设备效率。我们在实验中通过两种轻量化的光学流模型架构来评估我们提出的SCI和RFL技术的有效性。实验结果表明,SciFlow在两个不同的轻量级光学流模型架构上的效果非常显著。特别地,SciFlow在基线模型上将错误指标(EPE和Fl-all)的减少量分别达到6.3%和10.5%,在域场景和跨域场景上分别将误差减少6.2%和13.5%。
https://arxiv.org/abs/2404.08135
Heart rate is an important physiological indicator of human health status. Existing remote heart rate measurement methods typically involve facial detection followed by signal extraction from the region of interest (ROI). These SOTA methods have three serious problems: (a) inaccuracies even failures in detection caused by environmental influences or subject movement; (b) failures for special patients such as infants and burn victims; (c) privacy leakage issues resulting from collecting face video. To address these issues, we regard the remote heart rate measurement as the process of analyzing the spatiotemporal characteristics of the optical flow signal in the video. We apply chaos theory to computer vision tasks for the first time, thus designing a brain-inspired framework. Firstly, using an artificial primary visual cortex model to extract the skin in the videos, and then calculate heart rate by time-frequency analysis on all pixels. Our method achieves Robust Skin Tracking for Heart Rate measurement, called HR-RST. The experimental results show that HR-RST overcomes the difficulty of environmental influences and effectively tracks the subject movement. Moreover, the method could extend to other body parts. Consequently, the method can be applied to special patients and effectively protect individual privacy, offering an innovative solution.
的心率是评估人类健康状况的重要生理指标。现有的远程心率测量方法通常包括从感兴趣区域(ROI)的信号提取,然后进行面部检测。这些SOTA方法有三个严重问题:(一)由于环境因素或被检测者运动等原因导致的准确性甚至失败;(二)对特殊患者(如婴儿和烧伤患者)的失败;(三)通过收集面部视频导致的隐私泄露问题。为解决这些问题,我们将其视为分析视频中光学流信号的时空特征的过程。我们首先使用人工primary视觉皮层模型提取视频中的皮肤,然后通过时间-频率分析计算所有像素的心率。我们的方法实现了名为HR-RST的心率测量中的鲁棒皮肤跟踪。实验结果表明,HR-RST克服了环境因素带来的困难,有效跟踪了被检测者的运动。此外,该方法还可以应用于其他身体部位。因此,该方法可以应用于特殊患者,有效保护个人隐私,提供了一种创新的解决方案。
https://arxiv.org/abs/2404.07687
Optical flow is a classical task that is important to the vision community. Classical optical flow estimation uses two frames as input, whilst some recent methods consider multiple frames to explicitly model long-range information. The former ones limit their ability to fully leverage temporal coherence along the video sequence; and the latter ones incur heavy computational overhead, typically not possible for real-time flow estimation. Some multi-frame-based approaches even necessitate unseen future frames for current estimation, compromising real-time applicability in safety-critical scenarios. To this end, we present MemFlow, a real-time method for optical flow estimation and prediction with memory. Our method enables memory read-out and update modules for aggregating historical motion information in real-time. Furthermore, we integrate resolution-adaptive re-scaling to accommodate diverse video resolutions. Besides, our approach seamlessly extends to the future prediction of optical flow based on past observations. Leveraging effective historical motion aggregation, our method outperforms VideoFlow with fewer parameters and faster inference speed on Sintel and KITTI-15 datasets in terms of generalization performance. At the time of submission, MemFlow also leads in performance on the 1080p Spring dataset. Codes and models will be available at: this https URL.
光学流是一种经典任务,对视觉社区非常重要。经典的Optical flow估计使用两个帧作为输入,而一些最近的方法考虑多个帧以明确建模长距离信息。前者限制了其在视频序列中充分利用时间一致性的能力;而后者则导致计算开销巨大,通常不适用于实时流估计。一些基于多帧的方法甚至需要观察到的未来帧来进行当前估计,从而在安全关键场景中降低了实时应用的可行性。为此,我们提出了MemFlow,一种在内存中进行光学流估计和预测的实时方法。我们的方法允许在实时过程中聚合历史运动信息。此外,我们还采用分辨率自适应缩放,以适应不同的视频分辨率。此外,我们的方法还扩展到基于过去观察进行光学流未来预测。通过有效的历史运动聚合,我们的方法在Sintel和KITTI-15数据集上的性能优于VideoFlow,具有更少的参数和更快的推理速度。到提交时,MemFlow还在1080p Spring数据集上领先。代码和模型将在此处提供:https://这个链接。
https://arxiv.org/abs/2404.04808
Visual Odometry (VO) is vital for the navigation of autonomous systems, providing accurate position and orientation estimates at reasonable costs. While traditional VO methods excel in some conditions, they struggle with challenges like variable lighting and motion blur. Deep learning-based VO, though more adaptable, can face generalization problems in new environments. Addressing these drawbacks, this paper presents a novel hybrid visual odometry (VO) framework that leverages pose-only supervision, offering a balanced solution between robustness and the need for extensive labeling. We propose two cost-effective and innovative designs: a self-supervised homographic pre-training for enhancing optical flow learning from pose-only labels and a random patch-based salient point detection strategy for more accurate optical flow patch extraction. These designs eliminate the need for dense optical flow labels for training and significantly improve the generalization capability of the system in diverse and challenging environments. Our pose-only supervised method achieves competitive performance on standard datasets and greater robustness and generalization ability in extreme and unseen scenarios, even compared to dense optical flow-supervised state-of-the-art methods.
视觉里程计(VO)对于自主系统的导航至关重要,它可以在合理的成本下提供准确的定位和方向估计。虽然传统的VO方法在某些情况下表现出色,但它们在多变的光线和运动模糊等情况下遇到了挑战。基于深度学习的VO虽然更具有适应性,但在新的环境中可能会面临泛化问题。为了解决这些缺点,本文提出了一种新颖的混合视觉里程计(VO)框架,该框架利用姿态仅监督,提供了一种平衡的解决方案,即稳健性和大量标注的必要性。我们提出了两种成本效益和创新的设想:自监督同构预训练以增强姿态仅标签的光学流学习,以及基于随机补丁的显着点检测策略,用于更准确的光学流补丁提取。这些设计消除了训练和密集光学流标签的需要,显著提高了系统在多样和具有挑战性的环境中的泛化能力。我们姿态仅监督的方法在标准数据集上实现了与先进密集光学流监督方法的竞争性能,在极端和未见场景中具有更大的鲁棒性和泛化能力,即使与密集光学流监督方法相比也是如此。
https://arxiv.org/abs/2404.04677
Temporal Action Localization (TAL) involves localizing and classifying action snippets in an untrimmed video. The emergence of large video foundation models has led RGB-only video backbones to outperform previous methods needing both RGB and optical flow modalities. Leveraging these large models is often limited to training only the TAL head due to the prohibitively large GPU memory required to adapt the video backbone for TAL. To overcome this limitation, we introduce LoSA, the first memory-and-parameter-efficient backbone adapter designed specifically for TAL to handle untrimmed videos. LoSA specializes for TAL by introducing Long-Short-range Adapters that adapt the intermediate layers of the video backbone over different temporal ranges. These adapters run parallel to the video backbone to significantly reduce memory footprint. LoSA also includes Long-Short-range Fusion that strategically combines the output of these adapters from the video backbone layers to enhance the video features provided to the TAL head. Experiments show that LoSA significantly outperforms all existing methods on standard TAL benchmarks, THUMOS-14 and ActivityNet-v1.3, by scaling end-to-end backbone adaptation to billion-parameter-plus models like VideoMAEv2~(ViT-g) and leveraging them beyond head-only transfer learning.
Temporal Action Localization(TAL)涉及在未剪辑的视频中定位和分类动作片段。大型视频基础模型的发展使得仅使用RGB视频骨干的先前方法已经无法比需要同时具备RGB和光学流模态的先前方法更优秀。利用这些大型模型通常局限于仅训练TAL头部,因为需要横跨GPU内存训练视频骨干。为了克服这一限制,我们引入了LoSA,专为TAL设计的第一个内存和参数高效的骨架适配器,以处理未剪辑的视频。LoSA专门为TAL设计,通过引入长-短程适配器来调整视频骨干的中间层。这些适配器与视频骨干并行运行,显著减少了内存足迹。LoSA还包括长-短程融合,将来自视频骨干层的中输出适配器策略性地组合以增强TAL头提供的视频特征。实验证明,LoSA在标准TAL基准测试、THUMOS-14和ActivityNet-v1.3等所有现有方法中均显著胜出,通过将端到端骨架适应性扩展到像VideoMAEv2~(ViT-g)这样的大规模参数模型,并超越仅头部传输学习。
https://arxiv.org/abs/2404.01282
Pixel-wise regression tasks (e.g., monocular depth estimation (MDE) and optical flow estimation (OFE)) have been widely involved in our daily life in applications like autonomous driving, augmented reality and video composition. Although certain applications are security-critical or bear societal significance, the adversarial robustness of such models are not sufficiently studied, especially in the black-box scenario. In this work, we introduce the first unified black-box adversarial patch attack framework against pixel-wise regression tasks, aiming to identify the vulnerabilities of these models under query-based black-box attacks. We propose a novel square-based adversarial patch optimization framework and employ probabilistic square sampling and score-based gradient estimation techniques to generate the patch effectively and efficiently, overcoming the scalability problem of previous black-box patch attacks. Our attack prototype, named BadPart, is evaluated on both MDE and OFE tasks, utilizing a total of 7 models. BadPart surpasses 3 baseline methods in terms of both attack performance and efficiency. We also apply BadPart on the Google online service for portrait depth estimation, causing 43.5% relative distance error with 50K queries. State-of-the-art (SOTA) countermeasures cannot defend our attack effectively.
像素级回归任务(例如,单目深度估计(MDE)和光学流估计(OFE))在日常生活中广泛应用于自动驾驶、增强现实和视频编辑等应用中。虽然某些应用是安全关键或具有社会意义,但这类模型的对抗性鲁棒性尚未得到充分研究,尤其是在黑盒场景中。在本文中,我们提出了第一个针对像素级回归任务的统一黑盒攻击补丁攻击框架,旨在识别这些模型在基于查询的黑盒攻击下的漏洞。我们提出了一个新颖的平方基攻击补丁优化框架,并采用概率平方抽样和基于分数的梯度估计技术来生成补丁,有效克服了以前黑盒补丁攻击的规模问题。我们的攻击原型名为BadPart,在MDE和OFE任务上进行评估,使用了7个模型。BadPart在攻击效果和效率方面超过了3个基线方法。我们还将在Google在线服务上应用BadPart进行肖像深度估计,导致50K个查询的相对距离误差为43.5%。目前最先进的防御措施无法有效防御我们的攻击。
https://arxiv.org/abs/2404.00924
Self-supervised multi-frame methods have currently achieved promising results in depth estimation. However, these methods often suffer from mismatch problems due to the moving objects, which break the static assumption. Additionally, unfairness can occur when calculating photometric errors in high-freq or low-texture regions of the images. To address these issues, existing approaches use additional semantic priori black-box networks to separate moving objects and improve the model only at the loss level. Therefore, we propose FlowDepth, where a Dynamic Motion Flow Module (DMFM) decouples the optical flow by a mechanism-based approach and warps the dynamic regions thus solving the mismatch problem. For the unfairness of photometric errors caused by high-freq and low-texture regions, we use Depth-Cue-Aware Blur (DCABlur) and Cost-Volume sparsity loss respectively at the input and the loss level to solve the problem. Experimental results on the KITTI and Cityscapes datasets show that our method outperforms the state-of-the-art methods.
自监督的多帧方法在深度估计领域已经取得了很好的结果。然而,由于移动对象的存在,这些方法通常会导致不匹配问题,破坏了静态假设。此外,在图像高频或低纹理区域计算光度误差时,不公平现象也会发生。为解决这些问题,现有方法使用额外的语义prior黑盒网络将移动对象与模型分离,并在损失级别提高模型。因此,我们提出了FlowDepth,其中动态运动流模块(DMFM)通过基于机制的方法解耦了光度流量,并扭曲动态区域,从而解决了匹配问题。为了分别解决由高频和低纹理区域引起的 photometric 误差不公平现象,我们在输入和损失级别分别使用深度提示注意平滑(DCABlur)和成本体积稀疏损失。在KITTI和 Cityscapes数据集上的实验结果表明,我们的方法超越了最先进的方法。
https://arxiv.org/abs/2403.19294