Vision Based Navigation consists in utilizing cameras as precision sensors for GNC after extracting information from images. To enable the adoption of machine learning for space applications, one of obstacles is the demonstration that available training datasets are adequate to validate the algorithms. The objective of the study is to generate datasets of images and metadata suitable for training machine learning algorithms. Two use cases were selected and a robust methodology was developed to validate the datasets including the ground truth. The first use case is in-orbit rendezvous with a man-made object: a mockup of satellite ENVISAT. The second use case is a Lunar landing scenario. Datasets were produced from archival datasets (Chang'e 3), from the laboratory at DLR TRON facility and at Airbus Robotic laboratory, from SurRender software high fidelity image simulator using Model Capture and from Generative Adversarial Networks. The use case definition included the selection of algorithms as benchmark: an AI-based pose estimation algorithm and a dense optical flow algorithm were selected. Eventually it is demonstrated that datasets produced with SurRender and selected laboratory facilities are adequate to train machine learning algorithms.
基于视觉的导航是指在从图像中提取信息后,利用摄像头作为高精度传感器来验证惯导导航控制系统的算法的技术。为了使机器学习在空间应用中得到采用,一个障碍是证明所使用的训练数据集足够充分来验证算法。本研究旨在生成适合训练机器学习算法的图像和元数据的图像。选择了两个用例,并开发了一种包括地面真实值的 robust 方法来验证数据集。第一个用例是在轨会合与人工物体:卫星 ENVISAT 的模型对照。第二个用例是月球着陆场景。数据集来源于:(Chang'e 3)归档数据集,(DLR TRON)实验室和空客机器人实验室,SurRender 高保真度图像模拟软件使用 Model Capture 和 Generative Adversarial Networks。用例定义包括选择基准算法:选择了一种基于 AI 的姿态估计算法和一种基于密度的光流算法作为基准。最终证明,使用 SurRender 和所选实验室设施生成的数据集是足够充分来训练机器学习算法。
https://arxiv.org/abs/2409.11383
Weakly supervised video anomaly detection (WS-VAD) is a crucial area in computer vision for developing intelligent surveillance systems. This system uses three feature streams: RGB video, optical flow, and audio signals, where each stream extracts complementary spatial and temporal features using an enhanced attention module to improve detection accuracy and robustness. In the first stream, we employed an attention-based, multi-stage feature enhancement approach to improve spatial and temporal features from the RGB video where the first stage consists of a ViT-based CLIP module, with top-k features concatenated in parallel with I3D and Temporal Contextual Aggregation (TCA) based rich spatiotemporal features. The second stage effectively captures temporal dependencies using the Uncertainty-Regulated Dual Memory Units (UR-DMU) model, which learns representations of normal and abnormal data simultaneously, and the third stage is employed to select the most relevant spatiotemporal features. The second stream extracted enhanced attention-based spatiotemporal features from the flow data modality-based feature by taking advantage of the integration of the deep learning and attention module. The audio stream captures auditory cues using an attention module integrated with the VGGish model, aiming to detect anomalies based on sound patterns. These streams enrich the model by incorporating motion and audio signals often indicative of abnormal events undetectable through visual analysis alone. The concatenation of the multimodal fusion leverages the strengths of each modality, resulting in a comprehensive feature set that significantly improves anomaly detection accuracy and robustness across three datasets. The extensive experiment and high performance with the three benchmark datasets proved the effectiveness of the proposed system over the existing state-of-the-art system.
弱监督视频异常检测(WS-VAD)是计算机视觉中一个关键领域,用于开发智能监视系统。该系统使用三个特征流:RGB视频、光学流和音频信号,其中每个流通过增强注意模块提取互补的空间和时间特征,以提高检测精度和鲁棒性。 在第一流中,我们采用基于注意的级联特征增强方法来改善从RGB视频中的空间和时间特征,其中第一阶段包括基于ViT的CLIP模块, top-k特征与I3D和Temporal Contextual Aggregation (TCA)的丰富空间和时间特征并行。第二阶段有效地利用了的不确定性调节双内存单元(UR-DMU)模型来捕捉时间依赖关系。第三阶段用于选择最具相关性的空间和时间特征。第二流从流数据模式下的特征中提取增强注意的时空特征,通过利用深度学习和注意模块的集成来增强。音频流使用与VGGish模型集成的注意模块来捕捉音频线索,旨在基于声音模式检测异常。这些流通过整合运动和音频信号,往往无法通过视觉分析检测到的异常事件来丰富模型。多模态融合的串联利用了每个模态的优势,导致一个全面的特征集,显著提高了跨三个数据集的异常检测精度和鲁棒性。大量实验和三个基准数据集的高性能证明,与现有技术水平相比,所提出的系统具有显著的优越性。
https://arxiv.org/abs/2409.11223
Learning with neural networks from a continuous stream of visual information presents several challenges due to the non-i.i.d. nature of the data. However, it also offers novel opportunities to develop representations that are consistent with the information flow. In this paper we investigate the case of unsupervised continual learning of pixel-wise features subject to multiple motion-induced constraints, therefore named motion-conjugated feature representations. Differently from existing approaches, motion is not a given signal (either ground-truth or estimated by external modules), but is the outcome of a progressive and autonomous learning process, occurring at various levels of the feature hierarchy. Multiple motion flows are estimated with neural networks and characterized by different levels of abstractions, spanning from traditional optical flow to other latent signals originating from higher-level features, hence called higher-order motions. Continuously learning to develop consistent multi-order flows and representations is prone to trivial solutions, which we counteract by introducing a self-supervised contrastive loss, spatially-aware and based on flow-induced similarity. We assess our model on photorealistic synthetic streams and real-world videos, comparing to pre-trained state-of-the art feature extractors (also based on Transformers) and to recent unsupervised learning models, significantly outperforming these alternatives.
从连续的视觉信息中通过神经网络学习具有几个挑战性的问题,因为数据的非一致性。然而,它也带来了开发符合信息流一致性的表示的新机会。在本文中,我们研究了在受到多个运动引导约束的情况下进行无监督连续学习像素级特征的情况,因此称为运动共轭特征表示。与现有方法不同,运动不是给定的信号(无论是地面真实值还是由外部模块估计的),而是发生在前馈和学习过程中的一种渐进和自主学习的结果,发生于特征层次结构的各个层次。我们使用神经网络估计多个运动流,并对其进行建模,具有不同级别的抽象,从传统的光学流到来自更高层次特征的 其他潜在信号,因此称为高阶运动。连续学习开发一致的多阶流和表示很容易导致平凡解,我们通过引入自监督对比损失、空间感知和基于流引起的相似来对抗这种平凡解。我们在照片现实主义合成流和现实世界的视频中评估我们的模型,与预训练的基于Transformer的状态-of-the-art特征提取器(同样基于Transformer)和最近的无监督学习模型相比,显著超过了这些替代方案。
https://arxiv.org/abs/2409.11441
The ability of neural networks to perform robotic perception and control tasks such as depth and optical flow estimation, simultaneous localization and mapping (SLAM), and automatic control has led to their widespread adoption in recent years. Deep Reinforcement Learning has been used extensively in these settings, as it does not have the unsustainable training costs associated with supervised learning. However, DeepRL suffers from poor sample efficiency, i.e., it requires a large number of environmental interactions to converge to an acceptable solution. Modern RL algorithms such as Deep Q Learning and Soft Actor-Critic attempt to remedy this shortcoming but can not provide the explainability required in applications such as autonomous robotics. Humans intuitively understand the long-time-horizon sequential tasks common in robotics. Properly using such intuition can make RL policies more explainable while enhancing their sample efficiency. In this work, we propose SHIRE, a novel framework for encoding human intuition using Probabilistic Graphical Models (PGMs) and using it in the Deep RL training pipeline to enhance sample efficiency. Our framework achieves 25-78% sample efficiency gains across the environments we evaluate at negligible overhead cost. Additionally, by teaching RL agents the encoded elementary behavior, SHIRE enhances policy explainability. A real-world demonstration further highlights the efficacy of policies trained using our framework.
神经网络在执行机器人感知和控制任务方面的能力,如深度和光学流估计、同时定位和映射(SLAM)以及自动控制,近年来得到了广泛的应用。深度强化学习在这些环境中得到了广泛应用,因为它不需要与监督学习相关的训练成本。然而,深度RL存在样本效率低的问题,即需要大量的环境交互才能收敛到可接受的结果。现代RL算法,如Deep Q Learning和Soft Actor-Critic,试图解决这个问题,但它们无法提供应用领域(如自动驾驶机器人)中所需的可解释性。 人类直觉在机器人领域中理解长时程 sequential 任务是直觉的。适当地利用这种直觉可以使RL策略更具可解释性,同时提高它们的样本效率。在本文中,我们提出了SHIRE,一种利用概率图层模型(PGMs)表示人类直觉的新颖框架,并将其应用于深度RL训练管道中,以提高样本效率。我们的框架在评估环境中实现了25-78%的样本效率提升,且在极小的开销成本下实现了这一目标。此外,通过教导RL代理机编码基本行为,SHIRE还提高了策略的可解释性。 通过在现实世界中部署我们的框架训练策略,进一步证明了使用我们框架训练策略的有效性。
https://arxiv.org/abs/2409.09990
Cloth manipulation is an important aspect of many everyday tasks and remains a significant challenge for robots. While existing research has made strides in tasks like cloth smoothing and folding, many studies struggle with common failure modes (crumpled corners/edges, incorrect grasp configurations) that a preliminary step of cloth layer detection can solve. We present a novel method for classifying the number of grasped cloth layers using a custom gripper equipped with DenseTact 2.0 optical tactile sensors. After grasping a cloth, the gripper performs an anthropomorphic rubbing motion while collecting optical flow, 6-axis wrench, and joint state data. Using this data in a transformer-based network achieves a test accuracy of 98.21% in correctly classifying the number of grasped layers, showing the effectiveness of our dynamic rubbing method. Evaluating different inputs and model architectures highlights the usefulness of using tactile sensor information and a transformer model for this task. A comprehensive dataset of 368 labeled trials was collected and made open-source along with this paper. Our project page is available at this https URL.
衣物操作是许多日常任务中一个重要的方面,对机器人来说仍然是一个重要的挑战。虽然现有的研究在诸如平滑和折叠衣物等任务方面取得了进展,但许多研究在常见的失效模式(皱纹/边缘错误的抓取配置,抓取未正确配置)上遇到困难,这些问题可以通过初步的布层检测步骤来解决。我们提出了一种使用配备DenseTact 2.0光学触觉传感器的自定义抓器进行分类的新方法,该抓器在握住布料后进行人形摩擦动作并收集光学流、6轴扭转和关节状态数据。将这种数据应用于Transformer网络,实现了对抓取层数的正确分类,其测试准确率为98.21%,证明了我们的动态摩擦方法的有效性。评估不同输入和模型架构的适用性,突出了使用触觉传感器信息和Transformer模型在这种情况下的实用性。与本文一起,我们还收集了368个带有标签的试验数据,并将其公开开源,项目页链接如下。
https://arxiv.org/abs/2409.09849
In recent years, workplaces and educational institutes have widely adopted virtual meeting platforms. This has led to a growing interest in analyzing and extracting insights from these meetings, which requires effective detection and tracking of unique individuals. In practice, there is no standardization in video meetings recording layout, and how they are captured across the different platforms and services. This, in turn, creates a challenge in acquiring this data stream and analyzing it in a uniform fashion. Our approach provides a solution to the most general form of video recording, usually consisting of a grid of participants (\cref{fig:videomeeting}) from a single video source with no metadata on participant locations, while using the least amount of constraints and assumptions as to how the data was acquired. Conventional approaches often use YOLO models coupled with tracking algorithms, assuming linear motion trajectories akin to that observed in CCTV footage. However, such assumptions fall short in virtual meetings, where participant video feed window can abruptly change location across the grid. In an organic video meeting setting, participants frequently join and leave, leading to sudden, non-linear movements on the video grid. This disrupts optical flow-based tracking methods that depend on linear motion. Consequently, standard object detection and tracking methods might mistakenly assign multiple participants to the same tracker. In this paper, we introduce a novel approach to track and re-identify participants in remote video meetings, by utilizing the spatio-temporal priors arising from the data in our domain. This, in turn, increases tracking capabilities compared to the use of general object tracking. Our approach reduces the error rate by 95% on average compared to YOLO-based tracking methods as a baseline.
近年来,许多企业和教育机构广泛采用虚拟会议平台。这导致了对这些会议进行分析和提取洞见的需求不断增加,这需要对独特个体的有效检测和跟踪。在实践中,视频会议录制布局没有标准化,而且它们在不同的平台和服务上的捕捉方式也没有标准化。这导致在获取此数据流并以统一方式分析它时存在挑战。我们的方法提供了解决最一般形式视频录制问题的方案,通常包括一个来自单个视频源的参与者的网格(\cref{fig:videomeeting})没有元数据,同时使用最少量的约束和假设来获取数据。 传统方法通常使用与跟踪算法耦合的YOLO模型,假定其类似于摄像机 footage 观察到的线性运动轨迹。然而,在虚拟会议中,参与者的视频流窗口会突然改变位置,破坏了基于线性运动轨迹的跟踪方法。因此,标准物体检测和跟踪方法可能会错误地将多个参与者分配到同一个跟踪器上。 在本文中,我们提出了一个在远程视频会议中跟踪和重新识别参与者的全新方法,通过利用我们领域中数据产生的空间时间先验。这进而提高了跟踪能力与使用通用物体跟踪方法相比。我们的方法将物体检测和跟踪误差率降低了95%。
https://arxiv.org/abs/2409.09841
Drag-based image editing has recently gained popularity for its interactivity and precision. However, despite the ability of text-to-image models to generate samples within a second, drag editing still lags behind due to the challenge of accurately reflecting user interaction while maintaining image content. Some existing approaches rely on computationally intensive per-image optimization or intricate guidance-based methods, requiring additional inputs such as masks for movable regions and text prompts, thereby compromising the interactivity of the editing process. We introduce InstantDrag, an optimization-free pipeline that enhances interactivity and speed, requiring only an image and a drag instruction as input. InstantDrag consists of two carefully designed networks: a drag-conditioned optical flow generator (FlowGen) and an optical flow-conditioned diffusion model (FlowDiffusion). InstantDrag learns motion dynamics for drag-based image editing in real-world video datasets by decomposing the task into motion generation and motion-conditioned image generation. We demonstrate InstantDrag's capability to perform fast, photo-realistic edits without masks or text prompts through experiments on facial video datasets and general scenes. These results highlight the efficiency of our approach in handling drag-based image editing, making it a promising solution for interactive, real-time applications.
基于拖动的图像编辑最近因其互动性和精确性而受到了欢迎。然而,尽管文本转图像模型的生成样本功能可以在一秒钟内完成,但拖动编辑仍然在保持图像内容的同时准确反映用户交互方面落后。一些现有方法依赖于计算密集型的每张图像优化或复杂的指导方法,需要额外的输入,如可移动区域的mask和文本提示,从而降低了编辑过程的互动性。我们介绍了一个无需优化即可提高互动性和速度的优化-free管道:InstantDrag。InstantDrag由两个精心设计的网络组成:一个拖动条件的光学流生成器(FlowGen)和一个光学流-条件下的扩散模型(FlowDiffusion)。InstantDrag通过将任务分解为运动生成和条件下的图像生成来学习基于拖动的图像编辑在现实视频数据集中的运动 dynamics。我们在面部视频数据集和一般场景上通过实验证明了InstantDrag进行快速、照片现实主义的编辑能力。这些结果突出了我们方法在处理基于拖动的图像编辑方面的效率,使其成为交互式、实时应用程序的有希望的解决方案。
https://arxiv.org/abs/2409.08857
Violence and abnormal behavior detection research have known an increase of interest in recent years, due mainly to a rise in crimes in large cities worldwide. In this work, we propose a deep learning architecture for violence detection which combines both recurrent neural networks (RNNs) and 2-dimensional convolutional neural networks (2D CNN). In addition to video frames, we use optical flow computed using the captured sequences. CNN extracts spatial characteristics in each frame, while RNN extracts temporal characteristics. The use of optical flow allows to encode the movements in the scenes. The proposed approaches reach the same level as the state-of-the-art techniques and sometime surpass them. It was validated on 3 databases achieving good results.
近年来,由于全球大城市犯罪率的上升,暴力与异常行为检测研究受到了越来越多的关注。在这项工作中,我们提出了一个结合循环神经网络(RNN)和2维卷积神经网络(2D CNN)的深度学习架构来进行暴力检测。除了视频帧,我们还使用捕获序列计算光学流。CNN提取每个帧的空间特征,而RNN提取时间特征。利用光学流编码场景中的运动。所提出的方法达到了与最先进技术的水平相同,有时甚至超越它们。它在一共3个数据库上进行了验证,取得了良好的结果。
https://arxiv.org/abs/2409.07581
The localization of Unmanned aerial vehicles (UAVs) in deep tunnels is extremely challenging due to their inaccessibility and hazardous environment. Conventional outdoor localization techniques (such as using GPS) and indoor localization techniques (such as those based on WiFi, Infrared (IR), Ultra-Wideband, etc.) do not work in deep tunnels. We are developing a UAV-based system for the inspection of defects in the Deep Tunnel Sewerage System (DTSS) in Singapore. To enable the UAV localization in the DTSS, we have developed a distance measurement module based on the optical flow technique. However, the standard optical flow technique does not work well in tunnels with poor lighting and a lack of features. Thus, we have developed an enhanced optical flow algorithm with prediction, to improve the distance measurement for UAVs in deep hazardous tunnels.
在深海隧道中,无人机(UAVs)的定位极其具有挑战性,因为它们无法接近且存在危险的环境。传统的户外定位技术(如使用GPS)和室内定位技术(如基于WiFi、红外线、超宽带等)在深海隧道中均无法使用。我们正在为新加坡的深海隧道污水系统(DTSS)开发基于UAV的缺陷检测系统。为了使UAV在DTSS中定位,我们开发了一种基于光流技术的距离测量模块。然而,标准光流技术在光线不足且缺乏特征的隧道中效果不佳。因此,我们开发了一种增强型光流算法,带有预测功能,以提高深海危险隧道中UAV的距离测量。
https://arxiv.org/abs/2409.07160
Achieving 3D understanding of non-Lambertian objects is an important task with many useful applications, but most existing algorithms struggle to deal with such objects. One major obstacle towards progress in this field is the lack of holistic non-Lambertian benchmarks -- most benchmarks have low scene and object diversity, and none provide multi-layer 3D annotations for objects occluded by transparent surfaces. In this paper, we introduce LayeredFlow, a real world benchmark containing multi-layer ground truth annotation for optical flow of non-Lambertian objects. Compared to previous benchmarks, our benchmark exhibits greater scene and object diversity, with 150k high quality optical flow and stereo pairs taken over 185 indoor and outdoor scenes and 360 unique objects. Using LayeredFlow as evaluation data, we propose a new task called multi-layer optical flow. To provide training data for this task, we introduce a large-scale densely-annotated synthetic dataset containing 60k images within 30 scenes tailored for non-Lambertian objects. Training on our synthetic dataset enables model to predict multi-layer optical flow, while fine-tuning existing optical flow methods on the dataset notably boosts their performance on non-Lambertian objects without compromising the performance on diffuse objects. Data is available at this https URL.
实现非兰伯特对象的3D理解是一个重要任务,有许多有益的应用,但现有的大多数算法在处理这类对象时都遇到困难。这个领域取得进展的一个主要障碍是缺乏整体的非兰伯特ian基准——大多数基准场景和物体多样性较低,而且没有一个提供物体被透明表面遮挡的3层多层 annotations。在本文中,我们介绍了LayeredFlow,一个包含非兰伯特物体光学流多层真实世界基准的数据集。与之前基准相比,我们的基准表现出更丰富的场景和物体多样性,包括150k个高质量的光学流和超过185个室内和室外场景的立体对,以及360个独特的物体。将LayeredFlow作为评估数据,我们提出了一个名为多层光学流的新任务。为了为这个任务提供训练数据,我们介绍了一个包含60k个图像的大型密集注释合成数据集,其中30个场景专门为非兰伯特物体定制。在训练我们的合成数据集上,模型能够预测多层光学流,而在这个数据集上对现有光流方法进行微调,显著提高了它们在非兰伯特物体上的性能,同时不牺牲对漫射物体的性能。数据可在此处访问:https://url.in/
https://arxiv.org/abs/2409.05688
With advancements in computer vision and deep learning, video-based human action recognition (HAR) has become practical. However, due to the complexity of the computation pipeline, running HAR on live video streams incurs excessive delays on embedded platforms. This work tackles the real-time performance challenges of HAR with four contributions: 1) an experimental study identifying a standard Optical Flow (OF) extraction technique as the latency bottleneck in a state-of-the-art HAR pipeline, 2) an exploration of the latency-accuracy tradeoff between the standard and deep learning approaches to OF extraction, which highlights the need for a novel, efficient motion feature extractor, 3) the design of Integrated Motion Feature Extractor (IMFE), a novel single-shot neural network architecture for motion feature extraction with drastic improvement in latency, 4) the development of RT-HARE, a real-time HAR system tailored for embedded platforms. Experimental results on an Nvidia Jetson Xavier NX platform demonstrated that RT-HARE realizes real-time HAR at a video frame rate of 30 frames per second while delivering high levels of recognition accuracy.
随着计算机视觉和深度学习的进步,基于视频的人动作识别(HAR)已经变得实用。然而,由于计算管道的复杂性,在嵌入式平台上运行HAR会导致过度的延迟。本文通过四个贡献解决了实时性能挑战:1)一项实验研究确定了最先进的HAR管道中光学流(OF)提取技术的延迟瓶颈,2)对标准和深度学习方法进行延迟-准确度权衡的探索,强调了需要一种新颖、高效的运动特征提取器,3)设计了一种集成的运动特征提取器(IMFE),这是一种新型的单击神经网络架构,具有极大的延迟改进,4)开发了实时HAR系统RT-HARE,专门为嵌入式平台定制。在Nvidia Jetson Xavier NX平台上进行实验结果表明,RT-HARE可以在每秒30帧的帧率下实现实时HAR,同时提供高水平的识别准确性。
https://arxiv.org/abs/2409.05662
Optical flow estimation is a fundamental and long-standing visual task. In this work, we present a novel method, dubbed HMAFlow, to improve optical flow estimation in these tough scenes, especially with small objects. The proposed model mainly consists of two core components: a Hierarchical Motion Field Alignment (HMA) module and a Correlation Self-Attention (CSA) module. In addition, we rebuild 4D cost volumes by employing a Multi-Scale Correlation Search (MCS) layer and replacing average pooling in common cost volumes with an search strategy using multiple search ranges. Experimental results demonstrate that our model achieves the best generalization performance in comparison to other state-of-the-art methods. Specifically, compared with RAFT, our method achieves relative error reductions of 14.2% and 3.4% on the clean pass and final pass of the Sintel online benchmark, respectively. On the KITTI test benchmark, HMAFlow surpasses RAFT and GMA in the Fl-all metric by a relative margin of 6.8% and 7.7%, respectively. To facilitate future research, our code will be made available at this https URL.
光束追踪估计是一个基本且长期的光学任务。在这项工作中,我们提出了一种名为HMAFlow的新方法,以提高在复杂场景中光学流估计的性能,特别是对于小物体。所提出的模型主要由两个核心组件组成:分层运动场对齐(HMA)模块和关联自注意(CSA)模块。此外,我们通过采用多尺度相关搜索(MCS)层并使用多个搜索范围来替换常见成本卷中的平均池化,从而重构了4D成本体积。实验结果表明,与最先进的其他方法相比,我们的模型具有最佳的一般化性能。具体来说,与RAFT相比,我们的方法在Sintel在线基准的干净通过和最终通过分别实现了14.2%和3.4%的相对误差减少。在KITTI测试基准上,HMAFlow在Fl-all指标上超过了RAFT和GMA,相对优势分别为6.8%和7.7%。为了促进未来的研究,我们的代码将在此处链接的URL上公开。
https://arxiv.org/abs/2409.05531
Facial movements play a crucial role in conveying altitude and intentions, and facial optical flow provides a dynamic and detailed representation of it. However, the scarcity of datasets and a modern baseline hinders the progress in facial optical flow research. This paper proposes FacialFlowNet (FFN), a novel large-scale facial optical flow dataset, and the Decomposed Facial Flow Model (DecFlow), the first method capable of decomposing facial flow. FFN comprises 9,635 identities and 105,970 image pairs, offering unprecedented diversity for detailed facial and head motion analysis. DecFlow features a facial semantic-aware encoder and a decomposed flow decoder, excelling in accurately estimating and decomposing facial flow into head and expression components. Comprehensive experiments demonstrate that FFN significantly enhances the accuracy of facial flow estimation across various optical flow methods, achieving up to an 11% reduction in Endpoint Error (EPE) (from 3.91 to 3.48). Moreover, DecFlow, when coupled with FFN, outperforms existing methods in both synthetic and real-world scenarios, enhancing facial expression analysis. The decomposed expression flow achieves a substantial accuracy improvement of 18% (from 69.1% to 82.1%) in micro-expressions recognition. These contributions represent a significant advancement in facial motion analysis and optical flow estimation. Codes and datasets can be found.
面部动作在传达海拔和意图中起着关键作用,而面部光学流提供了对其的动态和详细描述。然而,数据集的稀少和现代基线限制了面部光学流研究的发展。本文提出了FacialFlowNet(FFN)和Decomposed Facial Flow Model(DecFlow),这是第一个能够分解面部流的方法。FFN包括9,635个个体和105,970个图像对,为详细的面部和头部运动分析提供了空前的多样性。DecFlow配备了面部语义感知的编码器和解剖流动解码器,在准确估计和分解面部流到头和表情组件方面表现出色。综合实验证明,FFN在各种光流方法上显著提高了面部运动估计的准确性,实现了端点误差(EPE)降低至11%(从3.91到3.48)。此外,当DecFlow与FFN结合时,在虚拟和现实世界场景中均表现出优异的性能,提高了面部表情分析。分解表达流动在微表情识别方面取得了显著的准确性提高(从69.1%到82.1%)。这些贡献表明面部运动分析和光学流估计取得了显着的进步。代码和数据集可找到。
https://arxiv.org/abs/2409.05396
Current state-of-the-art flow methods are mostly based on dense all-pairs cost volumes. However, as image resolution increases, the computational and spatial complexity of constructing these cost volumes grows at a quartic rate, making these methods impractical for high-resolution images. In this paper, we propose a novel Hybrid Cost Volume for memory-efficient optical flow, named HCV. To construct HCV, we first propose a Top-k strategy to separate the 4D cost volume into two global 3D cost volumes. These volumes significantly reduce memory usage while retaining a substantial amount of matching information. We further introduce a local 4D cost volume with a local search space to supplement the local information for HCV. Based on HCV, we design a memory-efficient optical flow network, named HCVFlow. Compared to the recurrent flow methods based the all-pairs cost volumes, our HCVFlow significantly reduces memory consumption while ensuring high accuracy. We validate the effectiveness and efficiency of our method on the Sintel and KITTI datasets and real-world 4K (2160*3840) resolution images. Extensive experiments show that our HCVFlow has very low memory usage and outperforms other memory-efficient methods in terms of accuracy. The code is publicly available at this https URL.
目前最先进的流量方法主要基于密集的所有对对成本体积。然而,随着图像分辨率的增长,构建这些成本体积的计算和空间复杂性以四舍五入的速率增长,使得这些方法对于高分辨率图像来说不可行。在本文中,我们提出了一个名为HCV的新颖混合成本体积,用于具有内存高效的超分辨率光学流。为了构建HCV,我们首先提出了一种将4D成本体积划分为两个全局3D成本体积的Top-k策略。这些体积在降低内存使用的同时保留大量匹配信息。我们进一步引入了一个局部4D成本体积,具有局部搜索空间,以补充HCV中的局部信息。基于HCV,我们设计了一个内存高效的超分辨率光学流网络,名为HCVFlow。与所有对对成本体积为基础的循环流方法相比,我们的HCVFlow在降低内存使用的同时保证高准确性。我们在Sintel和KITTI数据集以及现实世界的4K(2160*3840)分辨率图像上验证了我们的方法的有效性和效率。大量实验证明,我们的HCVFlow具有非常低的内存使用,并且在准确性方面与其他内存高效的方法相媲美。代码公开可用,在https:// this URL。
https://arxiv.org/abs/2409.04243
Event cameras generate asynchronous and sparse event streams capturing changes in light intensity. They offer significant advantages over conventional frame-based cameras, such as a higher dynamic range and an extremely faster data rate, making them particularly useful in scenarios involving fast motion or challenging lighting conditions. Spiking neural networks (SNNs) share similar asynchronous and sparse characteristics and are well-suited for processing data from event cameras. Inspired by the potential of transformers and spike-driven transformers (spikeformers) in other computer vision tasks, we propose two solutions for fast and robust optical flow estimation for event cameras: STTFlowNet and SDformerFlow. STTFlowNet adopts a U-shaped artificial neural network (ANN) architecture with spatiotemporal shifted window self-attention (swin) transformer encoders, while SDformerFlow presents its fully spiking counterpart, incorporating swin spikeformer encoders. Furthermore, we present two variants of the spiking version with different neuron models. Our work is the first to make use of spikeformers for dense optical flow estimation. We conduct end-to-end training for all models using supervised learning. Our results yield state-of-the-art performance among SNN-based event optical flow methods on both the DSEC and MVSEC datasets, and show significant reduction in power consumption compared to the equivalent ANNs.
事件相机生成异步且稀疏的事件流,捕获光强度的变化。与传统的帧基相机相比,它们具有更高的动态范围和极快的数据速率,因此在涉及快速运动或具有挑战性照明条件的情境中,它们特别有用。尖峰神经网络(SNNs)具有类似的异步和稀疏特征,非常适合处理事件相机中的数据。受到其他计算机视觉任务中Transformer和尖峰驱动Transformer(spikeformers)的灵感的启发,我们提出了两种快速且鲁棒的事件相机光学流估计解决方案:STTFlowNet和SDformerFlow。STTFlowNet采用具有时域和空间平移的卷积神经网络(CNN)架构,其中swin卷积神经网络编码器,而SDformerFlow则呈现其完全尖峰的对应方案,包括swin尖峰神经网络编码器。此外,我们还提出了两种尖峰版本,具有不同的神经元模型。我们的工作是第一个利用尖峰形式器进行密集光学流估计。我们使用有监督学习对所有模型进行端到端训练。我们的结果表明,在DSEC和MVSEC数据集上,SNN基事件光学流方法的表现已经达到了最先进的水平,并且与相应的ANN相比,功耗显著降低。
https://arxiv.org/abs/2409.04082
Despite significant advancements in monocular depth estimation for static images, estimating video depth in the open world remains challenging, since open-world videos are extremely diverse in content, motion, camera movement, and length. We present DepthCrafter, an innovative method for generating temporally consistent long depth sequences with intricate details for open-world videos, without requiring any supplementary information such as camera poses or optical flow. DepthCrafter achieves generalization ability to open-world videos by training a video-to-depth model from a pre-trained image-to-video diffusion model, through our meticulously designed three-stage training strategy with the compiled paired video-depth datasets. Our training approach enables the model to generate depth sequences with variable lengths at one time, up to 110 frames, and harvest both precise depth details and rich content diversity from realistic and synthetic datasets. We also propose an inference strategy that processes extremely long videos through segment-wise estimation and seamless stitching. Comprehensive evaluations on multiple datasets reveal that DepthCrafter achieves state-of-the-art performance in open-world video depth estimation under zero-shot settings. Furthermore, DepthCrafter facilitates various downstream applications, including depth-based visual effects and conditional video generation.
尽管静态图像中单目深度估计取得了显著的进步,但在开放世界中估计视频深度仍然具有挑战性,因为开放世界视频的内容、运动、相机运动和长度差异极大。我们提出了一种创新的方法:DepthCrafter,用于生成具有复杂细节的时效性一致长深度序列,无需要求任何补充信息,如相机姿态或光流。通过精心设计的三个阶段的训练策略,DepthCrafter从预训练的图像-到视频扩散模型中训练视频到深度模型,实现了对开放世界视频的泛化能力。我们的训练方法使模型能够在同一时间生成不同长度的深度序列,多达110帧,并从现实和合成数据集中收获精确的深度细节和丰富的内容多样性。我们还提出了一个处理非常长视频的推理策略,通过逐帧估计和无缝拼接。在多个数据集上的全面评估显示,DepthCrafter在零散设置下实现了开放世界视频深度估计的卓越性能。此外,DepthCrafter还有助于各种下游应用,包括基于深度的视觉效果和条件视频生成。
https://arxiv.org/abs/2409.02095
Few-shot imitation learning relies on only a small amount of task-specific demonstrations to efficiently adapt a policy for a given downstream tasks. Retrieval-based methods come with a promise of retrieving relevant past experiences to augment this target data when learning policies. However, existing data retrieval methods fall under two extremes: they either rely on the existence of exact behaviors with visually similar scenes in the prior data, which is impractical to assume; or they retrieve based on semantic similarity of high-level language descriptions of the task, which might not be that informative about the shared low-level behaviors or motions across tasks that is often a more important factor for retrieving relevant data for policy learning. In this work, we investigate how we can leverage motion similarity in the vast amount of cross-task data to improve few-shot imitation learning of the target task. Our key insight is that motion-similar data carries rich information about the effects of actions and object interactions that can be leveraged during few-shot adaptation. We propose FlowRetrieval, an approach that leverages optical flow representations for both extracting similar motions to target tasks from prior data, and for guiding learning of a policy that can maximally benefit from such data. Our results show FlowRetrieval significantly outperforms prior methods across simulated and real-world domains, achieving on average 27% higher success rate than the best retrieval-based prior method. In the Pen-in-Cup task with a real Franka Emika robot, FlowRetrieval achieves 3.7x the performance of the baseline imitation learning technique that learns from all prior and target data. Website: this https URL
少样本 imitation 学习仅依赖于一小部分任务特定演示来有效地适应给定下游任务的策略。基于检索的方法承诺在学习策略时,可以检索到相关的过去经验来补充目标数据。然而,现有的数据检索方法分为两个极端:它们要么依赖于在先数据中存在与视觉上相似的场景的行为,这是不切实际的;要么根据高级语言描述任务的高层次语义相似性来检索,这可能不足以关于共享的低层次行为或动作对于策略学习的有用性。在这项工作中,我们研究了如何利用大量的跨任务数据中存在的运动相似性来提高少样本 imitation 学习目标任务的性能。我们的关键洞察是,运动相似的数据携带了关于动作和物体交互的有效信息,这些信息可以在少样本适应过程中被充分利用。我们提出了 FlowRetrieval 方法,该方法利用光学流表示来从先验数据中提取与目标任务相似的运动,并指导学习一个策略,以充分利用这种数据。我们的结果表明,FlowRetrieval 在模拟和现实世界域都显著优于先前的方法,平均成功率提高了 27%。在真 Franka Emika 机器人上的 Pen-in-Cup 任务中,FlowRetrieval 实现了与基于所有先验和目标数据学习的基准模仿学习技术的 3.7 倍性能。网站:https:// this https URL
https://arxiv.org/abs/2408.16944
Interpreting motion captured in image sequences is crucial for a wide range of computer vision applications. Typical estimation approaches include optical flow (OF), which approximates the apparent motion instantaneously in a scene, and multiple object tracking (MOT), which tracks the motion of subjects over time. Often, the motion of objects in a scene is governed by some underlying dynamical system which could be inferred by analyzing the motion of groups of objects. Standard motion analyses, however, are not designed to intuit flow dynamics from trajectory data, making such measurements difficult in practice. The goal of this work is to extend gradient-based dynamical systems analyses to real-world applications characterized by complex, feature-rich image sequences with imperfect tracers. The tracer trajectories are tracked using deep vision networks and gradients are approximated using Lagrangian gradient regression (LGR), a tool designed to estimate spatial gradients from sparse data. From gradients, dynamical features such as regions of coherent rotation and transport barriers are identified. The proposed approach is affordably implemented and enables advanced studies including the motion analysis of two distinct object classes in a single image sequence. Two examples of the method are presented on data sets for which standard gradient-based analyses do not apply.
解释图像序列中捕获的运动对于广泛的计算机视觉应用至关重要。典型的估计方法包括光流(OF),它近似场景中物体的显着运动,和多目标跟踪(MOT),它跟踪主题在时间上的运动。通常,场景中物体的运动由一些潜在的动态系统决定,可以通过分析物体群的运动来推断。然而,标准的运动分析方法并不是为了从轨迹数据中直观地推断流动力学,因此在实践中很难实现。本工作的目标是将基于梯度的动态系统分析扩展到具有复杂、丰富图像序列且存在不完美跟踪器的现实应用中。跟踪器轨迹使用深度视觉网络进行跟踪,而Lagrangian梯度回归(LGR)工具用于估计稀疏数据中的空间梯度。从梯度中,可以识别出动力学特征,如协调旋转区域和传输障碍。所提出的方法具有经济实惠的实现,并能够进行包括在单个图像序列中研究两个不同物体类别的运动分析在内的高级研究。为了解释标准梯度分析不适用于某些数据集的情况,我们提供了两个示例。
https://arxiv.org/abs/2408.16190
Autism spectrum disorder (ASD) is characterized by significant challenges in social interaction and comprehending communication signals. Recently, therapeutic interventions for ASD have increasingly utilized Deep learning powered-computer vision techniques to monitor individual progress over time. These models are trained on private, non-public datasets from the autism community, creating challenges in comparing results across different models due to privacy-preserving data-sharing issues. This work introduces MMASD+. MMASD+ consists of diverse data modalities, including 3D-Skeleton, 3D Body Mesh, and Optical Flow data. It integrates the capabilities of Yolov8 and Deep SORT algorithms to distinguish between the therapist and children, addressing a significant barrier in the original dataset. Additionally, a Multimodal Transformer framework is proposed to predict 11 action types and the presence of ASD. This framework achieves an accuracy of 95.03% for predicting action types and 96.42% for predicting ASD presence, demonstrating over a 10% improvement compared to models trained on single data modalities. These findings highlight the advantages of integrating multiple data modalities within the Multimodal Transformer framework.
autism spectrum disorder(ASD)以社交互动和理解沟通信号方面的重要挑战而闻名。近年来,ASD的治疗方法越来越利用基于深度学习的计算机视觉技术来监测个体随时间的进展。这些模型在保护隐私的数据集上进行训练,由于隐私保护数据共享问题,不同模型之间的结果比较存在挑战。这项工作介绍了MMASD+。MMASD+包括多样化的数据模式,包括3D-骨骼、3D人体网格和光学流数据。它整合了Yolov8和Deep SORT算法的功能,区分了治疗师和儿童,这是原始数据集中存在的显著障碍。此外,还提出了一个多模态Transformer框架来预测11种动作类型和ASD的存在。该框架在预测动作类型和ASD存在方面的准确度分别为95.03%和96.42%,表明与单数据模式训练的模型相比,性能提高了超过10%。这些发现突出了Multimodal Transformer框架在整合多个数据模式方面的优势。
https://arxiv.org/abs/2408.15077
The paper presents a vision-based obstacle avoidance strategy for lightweight self-driving cars that can be run on a CPU-only device using a single RGB-D camera. The method consists of two steps: visual perception and path planning. The visual perception part uses ORBSLAM3 enhanced with optical flow to estimate the car's poses and extract rich texture information from the scene. In the path planning phase, we employ a method combining a control Lyapunov function and control barrier function in the form of quadratic program (CLF-CBF-QP) together with an obstacle shape reconstruction process (SRP) to plan safe and stable trajectories. To validate the performance and robustness of the proposed method, simulation experiments were conducted with a car in various complex indoor environments using the Gazebo simulation environment. Our method can effectively avoid obstacles in the scenes. The proposed algorithm outperforms benchmark algorithms in achieving more stable and shorter trajectories across multiple simulated scenes.
本文提出了一种基于视野的轻量级自动驾驶汽车避障策略,可以在使用单个RGB-D相机的情况下,在CPU-only设备上运行。该方法包括两个步骤:视觉感知和路径规划。视觉感知部分使用ORBSLAM3增强光学流估计车的姿态,并从场景中提取丰富的纹理信息。在路径规划阶段,我们采用一种控制Lyapunov函数和控制障碍物函数的形式为二次规划(CLF-CBF-QP)与障碍物形状重构过程(SRP)相结合的方法,规划安全且稳定的轨迹。为了验证所提出方法的性能和鲁棒性,使用Gazebo仿真环境对各种复杂室内环境中的汽车进行了仿真实验。我们的方法可以有效避开场景中的障碍物。所提出的算法在多个模拟场景中实现了更稳定和更短的路径,超过了基准算法。
https://arxiv.org/abs/2408.11582