Spin plays a pivotal role in ball-based sports. Estimating spin becomes a key skill due to its impact on the ball's trajectory and bouncing behavior. Spin cannot be observed directly, making it inherently challenging to estimate. In table tennis, the combination of high velocity and spin renders traditional low frame rate cameras inadequate for quickly and accurately observing the ball's logo to estimate the spin due to the motion blur. Event cameras do not suffer as much from motion blur, thanks to their high temporal resolution. Moreover, the sparse nature of the event stream solves communication bandwidth limitations many frame cameras face. To the best of our knowledge, we present the first method for table tennis spin estimation using an event camera. We use ordinal time surfaces to track the ball and then isolate the events generated by the logo on the ball. Optical flow is then estimated from the extracted events to infer the ball's spin. We achieved a spin magnitude mean error of $10.7 \pm 17.3$ rps and a spin axis mean error of $32.9 \pm 38.2°$ in real time for a flying ball.
旋转在球类运动中扮演着关键角色。由于其对球轨迹和弹起行为的影响,估计旋转成为了一个关键技能。由于无法直接观察到旋转,因此估计旋转本质上具有挑战性。在乒乓球中,高速度和高旋转使得传统低帧率相机无法快速且准确地观察到球的标志,从而导致运动模糊。事件相机由于其高时间分辨率,没有像事件相机那样受到运动模糊的影响。此外,事件流稀疏的特性解决了许多帧相机面临的通信带宽限制。据我们所知,我们首先提出了一种使用事件相机进行乒乓球旋转估计的方法。我们使用序时表面跟踪球,然后从球上估计标志的事件。然后通过提取这些事件估计球的旋转。我们可以在实时飞行球中实现球旋转 magnitude 平均误差为 $10.7 \pm 17.3$ rps 和轴旋转平均误差为 $32.9 \pm 38.2^\circ$。
https://arxiv.org/abs/2404.09870
The task of face reenactment is to transfer the head motion and facial expressions from a driving video to the appearance of a source image, which may be of a different person (cross-reenactment). Most existing methods are CNN-based and estimate optical flow from the source image to the current driving frame, which is then inpainted and refined to produce the output animation. We propose a transformer-based encoder for computing a set-latent representation of the source image(s). We then predict the output color of a query pixel using a transformer-based decoder, which is conditioned with keypoints and a facial expression vector extracted from the driving frame. Latent representations of the source person are learned in a self-supervised manner that factorize their appearance, head pose, and facial expressions. Thus, they are perfectly suited for cross-reenactment. In contrast to most related work, our method naturally extends to multiple source images and can thus adapt to person-specific facial dynamics. We also propose data augmentation and regularization schemes that are necessary to prevent overfitting and support generalizability of the learned representations. We evaluated our approach in a randomized user study. The results indicate superior performance compared to the state-of-the-art in terms of motion transfer quality and temporal consistency.
面部复原的任务是将来自驾驶视频的头动量和面部表情转移到源图像的 appearance上,这可能是不同的人(跨复原)。现有的方法基于CNN,估计源图像到当前驾驶帧的视差,然后修复和优化以产生输出动画。我们提出了一种基于Transformer的编码器来计算源图像的集合潜在表示。然后,我们使用基于Transformer的解码器预测查询像素的输出颜色,其中条件基于关键点和从驾驶帧中提取的面部表情向量。 源人物的潜在表示是在自监督的方式下学习,将他们的外观、头姿势和面部表情分解成不同的组件。因此,它们非常适合跨复原。与大多数相关的工作不同,我们的方法自然地扩展到多个源图像,从而可以适应个性化的面部动态。我们还提出了数据增强和正则化方案,以防止过拟合和支持学习表示的泛化。我们在随机用户研究中评估了我们的方法。结果表明,与最先进的技术相比,在运动传递质量和时间一致性方面具有优越的性能。
https://arxiv.org/abs/2404.09736
Recently, event-based vision sensors have gained attention for autonomous driving applications, as conventional RGB cameras face limitations in handling challenging dynamic conditions. However, the availability of real-world and synthetic event-based vision datasets remains limited. In response to this gap, we present SEVD, a first-of-its-kind multi-view ego, and fixed perception synthetic event-based dataset using multiple dynamic vision sensors within the CARLA simulator. Data sequences are recorded across diverse lighting (noon, nighttime, twilight) and weather conditions (clear, cloudy, wet, rainy, foggy) with domain shifts (discrete and continuous). SEVD spans urban, suburban, rural, and highway scenes featuring various classes of objects (car, truck, van, bicycle, motorcycle, and pedestrian). Alongside event data, SEVD includes RGB imagery, depth maps, optical flow, semantic, and instance segmentation, facilitating a comprehensive understanding of the scene. Furthermore, we evaluate the dataset using state-of-the-art event-based (RED, RVT) and frame-based (YOLOv8) methods for traffic participant detection tasks and provide baseline benchmarks for assessment. Additionally, we conduct experiments to assess the synthetic event-based dataset's generalization capabilities. The dataset is available at this https URL
最近,基于事件的视觉传感器在自动驾驶应用中引起了关注,因为传统的RGB相机在处理复杂动态条件时存在局限性。然而,实世界和合成事件基于视觉数据集仍然很少可用。为了填补这一空白,我们提出了SEVD,一种前所未有的多视角自利图像和用于CARLA仿真器中的多个动态视觉传感器固定的感知合成事件基于数据集。数据序列在不同的光照(中午,夜景,黄昏)和天气条件(晴朗,云层,潮湿,雨雾)下进行记录,领域转移(离散和连续)也是多样的。SEVD涵盖了城市、郊区、农村和高速公路场景,其中包括各种类型的物体(汽车,卡车,货车,自行车,摩托车和行人)。除了事件数据之外,SEVD还包括RGB图像,深度图,光流,语义和实例分割,从而实现了对场景的全面理解。此外,我们使用最先进的基于事件的(RED,RVT)和基于帧的方法(YOLOv8)对交通参与者检测任务进行评估,并为评估提供了基准基准基准。此外,我们还进行了实验,以评估合成事件基于数据集的泛化能力。该数据集可在https://url上找到。
https://arxiv.org/abs/2404.10540
Optical flow estimation is crucial to a variety of vision tasks. Despite substantial recent advancements, achieving real-time on-device optical flow estimation remains a complex challenge. First, an optical flow model must be sufficiently lightweight to meet computation and memory constraints to ensure real-time performance on devices. Second, the necessity for real-time on-device operation imposes constraints that weaken the model's capacity to adequately handle ambiguities in flow estimation, thereby intensifying the difficulty of preserving flow accuracy. This paper introduces two synergistic techniques, Self-Cleaning Iteration (SCI) and Regression Focal Loss (RFL), designed to enhance the capabilities of optical flow models, with a focus on addressing optical flow regression ambiguities. These techniques prove particularly effective in mitigating error propagation, a prevalent issue in optical flow models that employ iterative refinement. Notably, these techniques add negligible to zero overhead in model parameters and inference latency, thereby preserving real-time on-device efficiency. The effectiveness of our proposed SCI and RFL techniques, collectively referred to as SciFlow for brevity, is demonstrated across two distinct lightweight optical flow model architectures in our experiments. Remarkably, SciFlow enables substantial reduction in error metrics (EPE and Fl-all) over the baseline models by up to 6.3% and 10.5% for in-domain scenarios and by up to 6.2% and 13.5% for cross-domain scenarios on the Sintel and KITTI 2015 datasets, respectively.
光学流估计对于各种视觉任务至关重要。尽管在最近取得了重大进展,但实现实时在设备上进行光学流估计仍然是一个复杂挑战。首先,一个光学流模型必须足够轻量化,以满足计算和内存约束,以确保在设备上实现实时性能。其次,实时在设备上操作的必要性强化了模型在流估计中应对不确定性能力的限制,从而加大了保持流准确性的难度。本文介绍了两种协同技术,自清洁迭代(SCI)和回归焦点损失(RFL),旨在增强光学流模型的能力,特别关注解决光学流回归不确定性的问题。这些技术在减轻错误传播方面特别有效,这是在光学流模型中采用迭代精炼方法时普遍存在的问题。值得注意的是,这些技术在模型参数和推理延迟上增加的微不足道的开销可以保持实时设备效率。我们在实验中通过两种轻量化的光学流模型架构来评估我们提出的SCI和RFL技术的有效性。实验结果表明,SciFlow在两个不同的轻量级光学流模型架构上的效果非常显著。特别地,SciFlow在基线模型上将错误指标(EPE和Fl-all)的减少量分别达到6.3%和10.5%,在域场景和跨域场景上分别将误差减少6.2%和13.5%。
https://arxiv.org/abs/2404.08135
Heart rate is an important physiological indicator of human health status. Existing remote heart rate measurement methods typically involve facial detection followed by signal extraction from the region of interest (ROI). These SOTA methods have three serious problems: (a) inaccuracies even failures in detection caused by environmental influences or subject movement; (b) failures for special patients such as infants and burn victims; (c) privacy leakage issues resulting from collecting face video. To address these issues, we regard the remote heart rate measurement as the process of analyzing the spatiotemporal characteristics of the optical flow signal in the video. We apply chaos theory to computer vision tasks for the first time, thus designing a brain-inspired framework. Firstly, using an artificial primary visual cortex model to extract the skin in the videos, and then calculate heart rate by time-frequency analysis on all pixels. Our method achieves Robust Skin Tracking for Heart Rate measurement, called HR-RST. The experimental results show that HR-RST overcomes the difficulty of environmental influences and effectively tracks the subject movement. Moreover, the method could extend to other body parts. Consequently, the method can be applied to special patients and effectively protect individual privacy, offering an innovative solution.
的心率是评估人类健康状况的重要生理指标。现有的远程心率测量方法通常包括从感兴趣区域(ROI)的信号提取,然后进行面部检测。这些SOTA方法有三个严重问题:(一)由于环境因素或被检测者运动等原因导致的准确性甚至失败;(二)对特殊患者(如婴儿和烧伤患者)的失败;(三)通过收集面部视频导致的隐私泄露问题。为解决这些问题,我们将其视为分析视频中光学流信号的时空特征的过程。我们首先使用人工primary视觉皮层模型提取视频中的皮肤,然后通过时间-频率分析计算所有像素的心率。我们的方法实现了名为HR-RST的心率测量中的鲁棒皮肤跟踪。实验结果表明,HR-RST克服了环境因素带来的困难,有效跟踪了被检测者的运动。此外,该方法还可以应用于其他身体部位。因此,该方法可以应用于特殊患者,有效保护个人隐私,提供了一种创新的解决方案。
https://arxiv.org/abs/2404.07687
Optical flow is a classical task that is important to the vision community. Classical optical flow estimation uses two frames as input, whilst some recent methods consider multiple frames to explicitly model long-range information. The former ones limit their ability to fully leverage temporal coherence along the video sequence; and the latter ones incur heavy computational overhead, typically not possible for real-time flow estimation. Some multi-frame-based approaches even necessitate unseen future frames for current estimation, compromising real-time applicability in safety-critical scenarios. To this end, we present MemFlow, a real-time method for optical flow estimation and prediction with memory. Our method enables memory read-out and update modules for aggregating historical motion information in real-time. Furthermore, we integrate resolution-adaptive re-scaling to accommodate diverse video resolutions. Besides, our approach seamlessly extends to the future prediction of optical flow based on past observations. Leveraging effective historical motion aggregation, our method outperforms VideoFlow with fewer parameters and faster inference speed on Sintel and KITTI-15 datasets in terms of generalization performance. At the time of submission, MemFlow also leads in performance on the 1080p Spring dataset. Codes and models will be available at: this https URL.
光学流是一种经典任务,对视觉社区非常重要。经典的Optical flow估计使用两个帧作为输入,而一些最近的方法考虑多个帧以明确建模长距离信息。前者限制了其在视频序列中充分利用时间一致性的能力;而后者则导致计算开销巨大,通常不适用于实时流估计。一些基于多帧的方法甚至需要观察到的未来帧来进行当前估计,从而在安全关键场景中降低了实时应用的可行性。为此,我们提出了MemFlow,一种在内存中进行光学流估计和预测的实时方法。我们的方法允许在实时过程中聚合历史运动信息。此外,我们还采用分辨率自适应缩放,以适应不同的视频分辨率。此外,我们的方法还扩展到基于过去观察进行光学流未来预测。通过有效的历史运动聚合,我们的方法在Sintel和KITTI-15数据集上的性能优于VideoFlow,具有更少的参数和更快的推理速度。到提交时,MemFlow还在1080p Spring数据集上领先。代码和模型将在此处提供:https://这个链接。
https://arxiv.org/abs/2404.04808
Visual Odometry (VO) is vital for the navigation of autonomous systems, providing accurate position and orientation estimates at reasonable costs. While traditional VO methods excel in some conditions, they struggle with challenges like variable lighting and motion blur. Deep learning-based VO, though more adaptable, can face generalization problems in new environments. Addressing these drawbacks, this paper presents a novel hybrid visual odometry (VO) framework that leverages pose-only supervision, offering a balanced solution between robustness and the need for extensive labeling. We propose two cost-effective and innovative designs: a self-supervised homographic pre-training for enhancing optical flow learning from pose-only labels and a random patch-based salient point detection strategy for more accurate optical flow patch extraction. These designs eliminate the need for dense optical flow labels for training and significantly improve the generalization capability of the system in diverse and challenging environments. Our pose-only supervised method achieves competitive performance on standard datasets and greater robustness and generalization ability in extreme and unseen scenarios, even compared to dense optical flow-supervised state-of-the-art methods.
视觉里程计(VO)对于自主系统的导航至关重要,它可以在合理的成本下提供准确的定位和方向估计。虽然传统的VO方法在某些情况下表现出色,但它们在多变的光线和运动模糊等情况下遇到了挑战。基于深度学习的VO虽然更具有适应性,但在新的环境中可能会面临泛化问题。为了解决这些缺点,本文提出了一种新颖的混合视觉里程计(VO)框架,该框架利用姿态仅监督,提供了一种平衡的解决方案,即稳健性和大量标注的必要性。我们提出了两种成本效益和创新的设想:自监督同构预训练以增强姿态仅标签的光学流学习,以及基于随机补丁的显着点检测策略,用于更准确的光学流补丁提取。这些设计消除了训练和密集光学流标签的需要,显著提高了系统在多样和具有挑战性的环境中的泛化能力。我们姿态仅监督的方法在标准数据集上实现了与先进密集光学流监督方法的竞争性能,在极端和未见场景中具有更大的鲁棒性和泛化能力,即使与密集光学流监督方法相比也是如此。
https://arxiv.org/abs/2404.04677
Temporal Action Localization (TAL) involves localizing and classifying action snippets in an untrimmed video. The emergence of large video foundation models has led RGB-only video backbones to outperform previous methods needing both RGB and optical flow modalities. Leveraging these large models is often limited to training only the TAL head due to the prohibitively large GPU memory required to adapt the video backbone for TAL. To overcome this limitation, we introduce LoSA, the first memory-and-parameter-efficient backbone adapter designed specifically for TAL to handle untrimmed videos. LoSA specializes for TAL by introducing Long-Short-range Adapters that adapt the intermediate layers of the video backbone over different temporal ranges. These adapters run parallel to the video backbone to significantly reduce memory footprint. LoSA also includes Long-Short-range Fusion that strategically combines the output of these adapters from the video backbone layers to enhance the video features provided to the TAL head. Experiments show that LoSA significantly outperforms all existing methods on standard TAL benchmarks, THUMOS-14 and ActivityNet-v1.3, by scaling end-to-end backbone adaptation to billion-parameter-plus models like VideoMAEv2~(ViT-g) and leveraging them beyond head-only transfer learning.
Temporal Action Localization(TAL)涉及在未剪辑的视频中定位和分类动作片段。大型视频基础模型的发展使得仅使用RGB视频骨干的先前方法已经无法比需要同时具备RGB和光学流模态的先前方法更优秀。利用这些大型模型通常局限于仅训练TAL头部,因为需要横跨GPU内存训练视频骨干。为了克服这一限制,我们引入了LoSA,专为TAL设计的第一个内存和参数高效的骨架适配器,以处理未剪辑的视频。LoSA专门为TAL设计,通过引入长-短程适配器来调整视频骨干的中间层。这些适配器与视频骨干并行运行,显著减少了内存足迹。LoSA还包括长-短程融合,将来自视频骨干层的中输出适配器策略性地组合以增强TAL头提供的视频特征。实验证明,LoSA在标准TAL基准测试、THUMOS-14和ActivityNet-v1.3等所有现有方法中均显著胜出,通过将端到端骨架适应性扩展到像VideoMAEv2~(ViT-g)这样的大规模参数模型,并超越仅头部传输学习。
https://arxiv.org/abs/2404.01282
Pixel-wise regression tasks (e.g., monocular depth estimation (MDE) and optical flow estimation (OFE)) have been widely involved in our daily life in applications like autonomous driving, augmented reality and video composition. Although certain applications are security-critical or bear societal significance, the adversarial robustness of such models are not sufficiently studied, especially in the black-box scenario. In this work, we introduce the first unified black-box adversarial patch attack framework against pixel-wise regression tasks, aiming to identify the vulnerabilities of these models under query-based black-box attacks. We propose a novel square-based adversarial patch optimization framework and employ probabilistic square sampling and score-based gradient estimation techniques to generate the patch effectively and efficiently, overcoming the scalability problem of previous black-box patch attacks. Our attack prototype, named BadPart, is evaluated on both MDE and OFE tasks, utilizing a total of 7 models. BadPart surpasses 3 baseline methods in terms of both attack performance and efficiency. We also apply BadPart on the Google online service for portrait depth estimation, causing 43.5% relative distance error with 50K queries. State-of-the-art (SOTA) countermeasures cannot defend our attack effectively.
像素级回归任务(例如,单目深度估计(MDE)和光学流估计(OFE))在日常生活中广泛应用于自动驾驶、增强现实和视频编辑等应用中。虽然某些应用是安全关键或具有社会意义,但这类模型的对抗性鲁棒性尚未得到充分研究,尤其是在黑盒场景中。在本文中,我们提出了第一个针对像素级回归任务的统一黑盒攻击补丁攻击框架,旨在识别这些模型在基于查询的黑盒攻击下的漏洞。我们提出了一个新颖的平方基攻击补丁优化框架,并采用概率平方抽样和基于分数的梯度估计技术来生成补丁,有效克服了以前黑盒补丁攻击的规模问题。我们的攻击原型名为BadPart,在MDE和OFE任务上进行评估,使用了7个模型。BadPart在攻击效果和效率方面超过了3个基线方法。我们还将在Google在线服务上应用BadPart进行肖像深度估计,导致50K个查询的相对距离误差为43.5%。目前最先进的防御措施无法有效防御我们的攻击。
https://arxiv.org/abs/2404.00924
Self-supervised multi-frame methods have currently achieved promising results in depth estimation. However, these methods often suffer from mismatch problems due to the moving objects, which break the static assumption. Additionally, unfairness can occur when calculating photometric errors in high-freq or low-texture regions of the images. To address these issues, existing approaches use additional semantic priori black-box networks to separate moving objects and improve the model only at the loss level. Therefore, we propose FlowDepth, where a Dynamic Motion Flow Module (DMFM) decouples the optical flow by a mechanism-based approach and warps the dynamic regions thus solving the mismatch problem. For the unfairness of photometric errors caused by high-freq and low-texture regions, we use Depth-Cue-Aware Blur (DCABlur) and Cost-Volume sparsity loss respectively at the input and the loss level to solve the problem. Experimental results on the KITTI and Cityscapes datasets show that our method outperforms the state-of-the-art methods.
自监督的多帧方法在深度估计领域已经取得了很好的结果。然而,由于移动对象的存在,这些方法通常会导致不匹配问题,破坏了静态假设。此外,在图像高频或低纹理区域计算光度误差时,不公平现象也会发生。为解决这些问题,现有方法使用额外的语义prior黑盒网络将移动对象与模型分离,并在损失级别提高模型。因此,我们提出了FlowDepth,其中动态运动流模块(DMFM)通过基于机制的方法解耦了光度流量,并扭曲动态区域,从而解决了匹配问题。为了分别解决由高频和低纹理区域引起的 photometric 误差不公平现象,我们在输入和损失级别分别使用深度提示注意平滑(DCABlur)和成本体积稀疏损失。在KITTI和 Cityscapes数据集上的实验结果表明,我们的方法超越了最先进的方法。
https://arxiv.org/abs/2403.19294
Deep learning-based video compression is a challenging task, and many previous state-of-the-art learning-based video codecs use optical flows to exploit the temporal correlation between successive frames and then compress the residual error. Although these two-stage models are end-to-end optimized, the epistemic uncertainty in the motion estimation and the aleatoric uncertainty from the quantization operation lead to errors in the intermediate representations and introduce artifacts in the reconstructed frames. This inherent flaw limits the potential for higher bit rate savings. To address this issue, we propose an uncertainty-aware video compression model that can effectively capture the predictive uncertainty with deep ensembles. Additionally, we introduce an ensemble-aware loss to encourage the diversity among ensemble members and investigate the benefits of incorporating adversarial training in the video compression task. Experimental results on 1080p sequences show that our model can effectively save bits by more than 20% compared to DVC Pro.
基于深度学习的视频压缩是一个具有挑战性的任务,许多先前的基于学习的视频压缩码系使用光流来利用连续帧之间的时间相关性,然后压缩残余误差。尽管这两种二级模型都是端到端优化的,但运动估计的知理不确定性和量化操作引起的 aleatoric 不确定性导致中间表示中的误差,并引入了重构帧中的伪影。这种固有缺陷限制了高比特率节省的潜力。为了解决这个问题,我们提出了一个具有不确定性的视频压缩模型,可以有效地捕捉深度学习模型的预测不确定性。此外,我们还引入了一个元学习损失,以鼓励集合成员之间的多样性,并研究在视频压缩任务中引入对抗训练的潜力。在 1080p 序列上的实验结果表明,我们的模型能够比 DVC Pro 节省超过 20%的比特数。
https://arxiv.org/abs/2403.19158
Self-supervised monocular depth estimation methods have been increasingly given much attention due to the benefit of not requiring large, labelled datasets. Such self-supervised methods require high-quality salient features and consequently suffer from severe performance drop for indoor scenes, where low-textured regions dominant in the scenes are almost indiscriminative. To address the issue, we propose a self-supervised indoor monocular depth estimation framework called $\mathrm{F^2Depth}$. A self-supervised optical flow estimation network is introduced to supervise depth learning. To improve optical flow estimation performance in low-textured areas, only some patches of points with more discriminative features are adopted for finetuning based on our well-designed patch-based photometric loss. The finetuned optical flow estimation network generates high-accuracy optical flow as a supervisory signal for depth estimation. Correspondingly, an optical flow consistency loss is designed. Multi-scale feature maps produced by finetuned optical flow estimation network perform warping to compute feature map synthesis loss as another supervisory signal for depth learning. Experimental results on the NYU Depth V2 dataset demonstrate the effectiveness of the framework and our proposed losses. To evaluate the generalization ability of our $\mathrm{F^2Depth}$, we collect a Campus Indoor depth dataset composed of approximately 1500 points selected from 99 images in 18 scenes. Zero-shot generalization experiments on 7-Scenes dataset and Campus Indoor achieve $\delta_1$ accuracy of 75.8% and 76.0% respectively. The accuracy results show that our model can generalize well to monocular images captured in unknown indoor scenes.
自监督单目深度估计方法因无需大而标注的数据集的优势而得到了越来越多的关注。 这些自监督方法需要高质量的显著特征,因此对于室内场景中低纹理区域占主导地位的情况,其性能下降非常严重。为解决此问题,我们提出了一个自监督室内单目深度估计框架,称为$\mathrm{F^2Depth}$。引入了一个自监督的图像平滑估计算法来监督深度学习。为了提高低纹理区域的图像平滑估计算法性能,我们根据我们设计的基于补丁的广义光度损失来选择具有更好区分特性的点的补丁进行微调。 微调后的光度估计算法生成高度准确的图像平滑作为深度估计的监督信号。 相应地,还设计了一个光度平滑一致性损失。 在NYU Depth V2数据集上进行实验证明了我们框架的有效性和所提出的损失的有效性。为了评估我们$\mathrm{F^2Depth` 的泛化能力,我们收集了由99个场景中的大约1500个点组成的校园室内深度数据集。 基于7-Scenes数据集和校园内进行零散生成实验达到75.8%和76.0%的准确度。 准确度结果表明,我们的模型可以在未知室内场景中捕获的单目图像上表现出很好的泛化能力。
https://arxiv.org/abs/2403.18443
The scarcity of ground-truth labels poses one major challenge in developing optical flow estimation models that are both generalizable and robust. While current methods rely on data augmentation, they have yet to fully exploit the rich information available in labeled video sequences. We propose OCAI, a method that supports robust frame interpolation by generating intermediate video frames alongside optical flows in between. Utilizing a forward warping approach, OCAI employs occlusion awareness to resolve ambiguities in pixel values and fills in missing values by leveraging the forward-backward consistency of optical flows. Additionally, we introduce a teacher-student style semi-supervised learning method on top of the interpolated frames. Using a pair of unlabeled frames and the teacher model's predicted optical flow, we generate interpolated frames and flows to train a student model. The teacher's weights are maintained using Exponential Moving Averaging of the student. Our evaluations demonstrate perceptually superior interpolation quality and enhanced optical flow accuracy on established benchmarks such as Sintel and KITTI.
真实标签的稀缺性在开发既具有普适性又具有韧性的光流估计模型方面构成了一个主要挑战。虽然当前方法依赖于数据增强,但它们尚未充分利用已有的带标签视频序列中存在的丰富信息。我们提出了一种名为OCAI的方法,通过在光流之间生成中间视频帧来支持鲁棒帧插值。利用前膨胀方法,OCAI通过利用光流的前后一致性来解决像素值的不确定性,并利用光流的信息来填充缺失值。此外,我们在插值帧上引入了一种师生风格的中半监督学习方法。通过使用带标签帧和学生模型的预测光流,我们生成插值帧和光流以训练学生模型。学生模型的权重通过学生模型的学生指数进行维持。我们在Sintel和KITTI等已建立的标准基准上进行评估,证明了我们的插值质量和光流准确性的提高。
https://arxiv.org/abs/2403.18092
Accurate velocity estimation of surrounding moving objects and their trajectories are critical elements of perception systems in Automated/Autonomous Vehicles (AVs) with a direct impact on their safety. These are non-trivial problems due to the diverse types and sizes of such objects and their dynamic and random behaviour. Recent point cloud based solutions often use Iterative Closest Point (ICP) techniques, which are known to have certain limitations. For example, their computational costs are high due to their iterative nature, and their estimation error often deteriorates as the relative velocities of the target objects increase (>2 m/sec). Motivated by such shortcomings, this paper first proposes a novel Detection and Tracking of Moving Objects (DATMO) for AVs based on an optical flow technique, which is proven to be computationally efficient and highly accurate for such problems. \textcolor{black}{This is achieved by representing the driving scenario as a vector field and applying vector calculus theories to ensure spatiotemporal continuity.} We also report the results of a comprehensive performance evaluation of the proposed DATMO technique, carried out in this study using synthetic and real-world data. The results of this study demonstrate the superiority of the proposed technique, compared to the DATMO techniques in the literature, in terms of estimation accuracy and processing time in a wide range of relative velocities of moving objects. Finally, we evaluate and discuss the sensitivity of the estimation error of the proposed DATMO technique to various system and environmental parameters, as well as the relative velocities of the moving objects.
准确的速度估计周围运动的物体及其轨迹是自动驾驶车辆(AVs)感知系统的关键要素,对车辆的安全具有直接影响。由于这些物体具有多样化和大小的非 trivial 问题以及其动态和随机行为,因此这些问题并不简单。最近基于点云的解决方案通常使用迭代最近点(ICP)技术,而众所周知,这种技术存在某些局限性。例如,由于其迭代性质,它们的计算成本很高,并且随着目标物体相对速度的增加(>2 m/sec),它们的估计误差往往恶化。为了克服这些缺陷,本文首先提出了一个基于光流技术的自动驾驶车辆(DATMO)新检测和跟踪系统,该技术通过证明在类似问题中具有计算效率和高度准确性的特点而得到证实。\textcolor{black}{通过将驾驶场景表示为向量场,并应用向量微积分理论来确保时空连续性。}我们还对所提出的 DATMO 技术进行了全面性能评估,该评估在本文中使用合成和真实世界数据进行。本研究的结果表明,与文献中的 DATMO 技术相比,所提出的技术在广泛的相对速度范围内具有更高的估计精度和处理时间。最后,我们评估和讨论了所提出的 DATMO 技术的估计误差对各种系统和环境参数以及运动物体的相对速度的敏感性。
https://arxiv.org/abs/2403.17779
The advancement of generation models has led to the emergence of highly realistic artificial intelligence (AI)-generated videos. Malicious users can easily create non-existent videos to spread false information. This letter proposes an effective AI-generated video detection (AIGVDet) scheme by capturing the forensic traces with a two-branch spatio-temporal convolutional neural network (CNN). Specifically, two ResNet sub-detectors are learned separately for identifying the anomalies in spatical and optical flow domains, respectively. Results of such sub-detectors are fused to further enhance the discrimination ability. A large-scale generated video dataset (GVD) is constructed as a benchmark for model training and evaluation. Extensive experimental results verify the high generalization and robustness of our AIGVDet scheme. Code and dataset will be available at this https URL.
随着生成模型的进步,已经出现了高度逼真的人工智能(AI)生成的视频。恶意用户可以轻松地创建不存在的视频传播虚假信息。本文提出了一种通过捕获带两个分支时空卷积神经网络(CNN)的鉴定痕迹的有效人工智能生成视频(AIGVDet)方案。具体来说,分别学习两个ResNet子检测器来识别空间和光学流域中的异常。这样的子检测器的检测结果被融合以进一步增强识别能力。构建了一个大规模生成的视频数据集(GVD)作为模型训练和评估的基准。大量实验结果证实了我们AIGVDet方案的高通性和鲁棒性。代码和数据集将在这个链接处提供。
https://arxiv.org/abs/2403.16638
Applications of an efficient emotion recognition system can be found in several domains such as medicine, driver fatigue surveillance, social robotics, and human-computer interaction. Appraising human emotional states, behaviors, and reactions displayed in real-world settings can be accomplished using latent continuous dimensions. Continuous dimensional models of human affect, such as those based on valence and arousal are more accurate in describing a broad range of spontaneous everyday emotions than more traditional models of discrete stereotypical emotion categories (e.g. happiness, surprise). Most of the prior work on estimating valence and arousal considers laboratory settings and acted data. But, for emotion recognition systems to be deployed and integrated into real-world mobile and computing devices, we need to consider data collected in the world. Action recognition is a domain of Computer Vision that involves capturing complementary information on appearance from still frames and motion between frames. In this paper, we treat emotion recognition from the perspective of action recognition by exploring the application of deep learning architectures specifically designed for action recognition, for continuous affect recognition. We propose a novel three-stream end-to-end deep learning regression pipeline with an attention mechanism, which is an ensemble design based on sub-modules of multiple state-of-the-art action recognition systems. The pipeline constitutes a novel data pre-processing approach with a spatial self-attention mechanism to extract keyframes. The optical flow of high-attention regions of the face is extracted to capture temporal context. AFEW-VA in-the-wild dataset has been used to conduct comparative experiments. Quantitative analysis shows that the proposed model outperforms multiple standard baselines of both emotion recognition and action recognition models.
高效情感识别系统的应用范围存在于医学、驾驶员疲劳监测、社会机器人学和人机交互等多个领域。评估现实场景中的人类情感状态、行为和反应可以使用潜在连续维度。基于愉悦和激情的连续维度模型比更传统的离散刻板情感分类模型更准确地描述广泛的日常自发性情感。在估计情感和情绪方面,大部分先前的研究都集中在实验室环境和已有的数据上。但是,为了将情感识别系统部署并集成到现实世界的移动和计算设备中,我们需要考虑从世界中收集的数据。动作识别是一个计算机视觉领域,涉及从静帧和帧之间的运动中捕捉互补信息。在本文中,我们从动作识别的角度来探讨应用专门为动作识别设计的深度学习架构,进行连续情感识别。我们提出了一个新颖的三流端到端深度学习回归管道,带有一个关注机制,这是基于多个最先进的动作识别系统的子模块的集成设计。该管道构成了一个新颖的数据预处理方法,具有空间自注意机制以提取关键帧。提取人脸高关注区域的光学流,以捕捉时间语境。AFEW-VA野外数据集已用于进行比较实验。定量分析表明,与情绪识别和动作识别模型的多个标准基线相比,所提出的模型具有优异的性能。
https://arxiv.org/abs/2403.16263
Implicit neural representations for video (NeRV) have recently become a novel way for high-quality video representation. However, existing works employ a single network to represent the entire video, which implicitly confuse static and dynamic information. This leads to an inability to effectively compress the redundant static information and lack the explicitly modeling of global temporal-coherent dynamic details. To solve above problems, we propose DS-NeRV, which decomposes videos into sparse learnable static codes and dynamic codes without the need for explicit optical flow or residual supervision. By setting different sampling rates for two codes and applying weighted sum and interpolation sampling methods, DS-NeRV efficiently utilizes redundant static information while maintaining high-frequency details. Additionally, we design a cross-channel attention-based (CCA) fusion module to efficiently fuse these two codes for frame decoding. Our approach achieves a high quality reconstruction of 31.2 PSNR with only 0.35M parameters thanks to separate static and dynamic codes representation and outperforms existing NeRV methods in many downstream tasks. Our project website is at this https URL.
Implicit neural representations for video (NeRV) have recently become a novel way for high-quality video representation. However, existing works employ a single network to represent the entire video, which implicitly confuses static and dynamic information. This leads to an inability to effectively compress the redundant static information and lack the explicitly modeling of global temporal-coherent dynamic details. 为了解决上述问题,我们提出了DS-NeRV,它将视频分解为稀疏可学习静态代码和动态代码,无需显式光流或残差监督。通过设置两个代码的不同采样率,并应用加权求和插值采样方法,DS-NeRV有效地利用冗余静态信息,同时保留高频细节。此外,我们还设计了一个跨通道关注(CCA)融合模块,用于有效地融合这两个代码进行帧解码。 我们的方法通过单独的静态和动态代码表示实现了31.2 PSNR的高质量重建,同时仅使用0.35M个参数。这使得我们在许多下游任务中超过了现有的NeRV方法。我们的项目网站是https://www.google.com/url?q=https://github.com/dantianzeng/DS-NeRV。
https://arxiv.org/abs/2403.15679
In minimally invasive endovascular procedures, contrast-enhanced angiography remains the most robust imaging technique. However, it is at the expense of the patient and clinician's health due to prolonged radiation exposure. As an alternative, interventional ultrasound has notable benefits such as being radiation-free, fast to deploy, and having a small footprint in the operating room. Yet, ultrasound is hard to interpret, and highly prone to artifacts and noise. Additionally, interventional radiologists must undergo extensive training before they become qualified to diagnose and treat patients effectively, leading to a shortage of staff, and a lack of open-source datasets. In this work, we seek to address both problems by introducing a self-supervised deep learning architecture to segment catheters in longitudinal ultrasound images, without demanding any labeled data. The network architecture builds upon AiAReSeg, a segmentation transformer built with the Attention in Attention mechanism, and is capable of learning feature changes across time and space. To facilitate training, we used synthetic ultrasound data based on physics-driven catheter insertion simulations, and translated the data into a unique CT-Ultrasound common domain, CACTUSS, to improve the segmentation performance. We generated ground truth segmentation masks by computing the optical flow between adjacent frames using FlowNet2, and performed thresholding to obtain a binary map estimate. Finally, we validated our model on a test dataset, consisting of unseen synthetic data and images collected from silicon aorta phantoms, thus demonstrating its potential for applications to clinical data in the future.
在最小侵入性介入手术中,增强型内窥镜血管造影仍然是最佳成像技术。然而,由于长时间的辐射暴露,患者和临床医生的健康受到影响。作为替代方案,介入超声具有明显的优势,例如它是辐射free的,部署快速,操作室占用空间小。然而,超声很难解释,易出现伪像和噪声。此外,介入放射科医生在成为有效地诊断和治疗患者之前,必须接受广泛培训,导致缺乏工作人员,缺乏开源数据集。在这项工作中,我们试图通过引入自监督的深度学习架构来解决这两个问题,而不需要任何标记数据。 网络架构基于AiAReSeg,这是使用注意力机制构建的分割转换器,能够学习时间和空间中特征的变化。为了促进训练,我们使用了基于物理仿真引导的穿刺器插入模拟的合成超声数据,并将数据转换为独特的CT-Ultrasound共同领域,CACTUSS,以提高分割性能。我们通过计算相邻帧之间的光流,计算光学流的伪像,并执行阈值以获得二进制图估计。最后,我们在包含未见过的合成数据和从硅动脉幻象收集的图像的测试数据集上验证了我们的模型,从而展示了其在未来应用于临床数据的可能性。
https://arxiv.org/abs/2403.14465
Visual simultaneous localization and mapping (VSLAM) has broad applications, with state-of-the-art methods leveraging deep neural networks for better robustness and applicability. However, there is a lack of research in fusing these learning-based methods with multi-sensor information, which could be indispensable to push related applications to large-scale and complex scenarios. In this paper, we tightly integrate the trainable deep dense bundle adjustment (DBA) with multi-sensor information through a factor graph. In the framework, recurrent optical flow and DBA are performed among sequential images. The Hessian information derived from DBA is fed into a generic factor graph for multi-sensor fusion, which employs a sliding window and supports probabilistic marginalization. A pipeline for visual-inertial integration is firstly developed, which provides the minimum ability of metric-scale localization and mapping. Furthermore, other sensors (e.g., global navigation satellite system) are integrated for driftless and geo-referencing functionality. Extensive tests are conducted on both public datasets and self-collected datasets. The results validate the superior localization performance of our approach, which enables real-time dense mapping in large-scale environments. The code has been made open-source (this https URL).
视觉同时定位和映射(VSLAM)具有广泛的应用,最先进的方法利用深度神经网络的优点来提高其稳健性和适用性。然而,将这些基于学习的方法与多传感器信息相结合的研究还很少,这对于推动相关应用向大规模和复杂场景实现至关重要。在本文中,我们将通过因子图将可训练的深度密集卷积 bundle adjustment(DBA)与多传感器信息相结合。在框架中,连续光流和 DBA 在序列图像之间执行。从 DBA 获得的 Hessian 信息被输入到通用因子图进行多传感器融合,该框架采用滑动窗口并支持概率边际。首先开发了视觉-惯性整合的流程,提供了最小的大规模局部定位和映射能力。此外,还集成了其他传感器(例如全球导航卫星系统)以实现无漂移和地理参考功能。在公开数据集和自收集数据集上进行了广泛的测试。测试结果证实了我们的方法在大型环境中的卓越定位性能,从而实现了在大型环境中的实时密集映射。该代码已公开开源(此 https URL)。
https://arxiv.org/abs/2403.13714
Diffusion models have achieved great success in image generation. However, when leveraging this idea for video generation, we face significant challenges in maintaining the consistency and continuity across video frames. This is mainly caused by the lack of an effective framework to align frames of videos with desired temporal features while preserving consistent semantic and stochastic features. In this work, we propose a novel Sector-Shaped Diffusion Model (S2DM) whose sector-shaped diffusion region is formed by a set of ray-shaped reverse diffusion processes starting at the same noise point. S2DM can generate a group of intrinsically related data sharing the same semantic and stochastic features while varying on temporal features with appropriate guided conditions. We apply S2DM to video generation tasks, and explore the use of optical flow as temporal conditions. Our experimental results show that S2DM outperforms many existing methods in the task of video generation without any temporal-feature modelling modules. For text-to-video generation tasks where temporal conditions are not explicitly given, we propose a two-stage generation strategy which can decouple the generation of temporal features from semantic-content features. We show that, without additional training, our model integrated with another temporal conditions generative model can still achieve comparable performance with existing works. Our results can be viewd at this https URL.
扩散模型在图像生成方面取得了巨大的成功。然而,在将这一想法应用于视频生成时,我们在保持帧之间的一致性和连续性方面面临着巨大的挑战。这主要是由缺乏一个有效的框架来在保留一致的语义和随机特征的同时,将视频帧与所需的时间特征对齐所引起的。在这项工作中,我们提出了一个新颖的Sector-Shaped Diffusion Model(S2DM),其扩散区域由一组以相同噪声点为起点的凸形反扩散过程形成。S2DM可以在具有相同语义和随机特征的一组数据中生成一组内插关系。我们将S2DM应用于视频生成任务,并探讨了使用光流作为时间条件。我们的实验结果表明,在没有任何时间特征建模模块的情况下,S2DM在视频生成任务中优于许多现有方法。对于没有明确给出时间条件的文本到视频生成任务,我们提出了一个两阶段生成策略,可以将生成时间特征与语义内容特征解耦。我们证明了,在没有额外训练的情况下,我们与另一个时间条件生成模型集成的模型可以与现有工作达到相当不错的性能。我们的结果可以在以下链接查看:https://url.cn/
https://arxiv.org/abs/2403.13408