We propose MotionAgent, enabling fine-grained motion control for text-guided image-to-video generation. The key technique is the motion field agent that converts motion information in text prompts into explicit motion fields, providing flexible and precise motion guidance. Specifically, the agent extracts the object movement and camera motion described in the text and converts them into object trajectories and camera extrinsics, respectively. An analytical optical flow composition module integrates these motion representations in 3D space and projects them into a unified optical flow. An optical flow adapter takes the flow to control the base image-to-video diffusion model for generating fine-grained controlled videos. The significant improvement in the Video-Text Camera Motion metrics on VBench indicates that our method achieves precise control over camera motion. We construct a subset of VBench to evaluate the alignment of motion information in the text and the generated video, outperforming other advanced models on motion generation accuracy.
我们提出了MotionAgent,它能够实现文本引导的图像到视频生成中的细粒度运动控制。关键技术是运动场代理,该技术将文本提示中的运动信息转换为明确的运动场,提供灵活且精确的运动指导。具体而言,该代理从文本中提取物体移动和相机动作,并分别将其转化为物体轨迹和相机外参。一个分析光学流合成模块在3D空间内整合这些运动表示并将其投影到统一的光学流中。光学流适配器接收此流动信息以控制基础图像到视频扩散模型,从而生成细粒度且受控的视频。Video-Text Camera Motion指标上的显著改进表明我们的方法实现了对相机动作的精确控制。我们构建了VBench的一个子集来评估文本中的运动信息与生成视频之间的对应关系,并在运动生成精度方面超过了其他高级模型。
https://arxiv.org/abs/2502.03207
This paper presents a novel approach to Visual Inertial Odometry (VIO), focusing on the initialization and feature matching modules. Existing methods for initialization often suffer from either poor stability in visual Structure from Motion (SfM) or fragility in solving a huge number of parameters simultaneously. To address these challenges, we propose a new pipeline for visual inertial initialization that robustly handles various complex scenarios. By tightly coupling gyroscope measurements, we enhance the robustness and accuracy of visual SfM. Our method demonstrates stable performance even with only four image frames, yielding competitive results. In terms of feature matching, we introduce a hybrid method that combines optical flow and descriptor-based matching. By leveraging the robustness of continuous optical flow tracking and the accuracy of descriptor matching, our approach achieves efficient, accurate, and robust tracking results. Through evaluation on multiple benchmarks, our method demonstrates state-of-the-art performance in terms of accuracy and success rate. Additionally, a video demonstration on mobile devices showcases the practical applicability of our approach in the field of Augmented Reality/Virtual Reality (AR/VR).
本文提出了一种新颖的视觉惯性里程计(VIO)方法,重点在于初始化和特征匹配模块。现有的初始化方法通常在视觉结构从运动(SfM)中的稳定性较差或同时解决大量参数时不够健壮。为了解决这些问题,我们提出了一种新的视觉惯性初始化流水线,能够稳健地处理各种复杂场景。通过紧密耦合陀螺仪测量数据,增强了视觉SfM的鲁棒性和精度。我们的方法即使在仅有四帧图像的情况下也能表现出稳定性能,并且结果具有竞争力。 在特征匹配方面,我们引入了一种结合光流和描述子匹配的混合方法。利用连续光流追踪的强大能力和基于描述子匹配的准确性,我们的方法实现了高效、准确且鲁棒的跟踪效果。通过多个基准测试评估,我们的方法在精度和成功率方面表现出最先进的性能。 此外,在移动设备上的视频演示展示了我们这种方法在增强现实/虚拟现实(AR/VR)领域中的实际应用性。
https://arxiv.org/abs/2502.01297
Real-time ego-motion tracking for endoscope is a significant task for efficient navigation and robotic automation of endoscopy. In this paper, a novel framework is proposed to perform real-time ego-motion tracking for endoscope. Firstly, a multi-modal visual feature learning network is proposed to perform relative pose prediction, in which the motion feature from the optical flow, the scene features and the joint feature from two adjacent observations are all extracted for prediction. Due to more correlation information in the channel dimension of the concatenated image, a novel feature extractor is designed based on an attention mechanism to integrate multi-dimensional information from the concatenation of two continuous frames. To extract more complete feature representation from the fused features, a novel pose decoder is proposed to predict the pose transformation from the concatenated feature map at the end of the framework. At last, the absolute pose of endoscope is calculated based on relative poses. The experiment is conducted on three datasets of various endoscopic scenes and the results demonstrate that the proposed method outperforms state-of-the-art methods. Besides, the inference speed of the proposed method is over 30 frames per second, which meets the real-time requirement. The project page is here: \href{this https URL}{this http URL}
实时内窥镜自运动跟踪对于高效导航和内窥镜手术的机器人自动化是一个重要任务。本文提出了一种新的框架,旨在实现内窥镜的实时自运动跟踪。首先,提出了一个多模态视觉特征学习网络来进行相对姿态预测,在该网络中,从光流、场景特征以及两个相邻观察点的联合特征都被提取出来用于预测。 由于在连接图像的通道维度上有更多的相关性信息,因此设计了一种基于注意力机制的新颖特征提取器,以整合来自连续两帧串联后的多维信息。为了从融合特性中提取更完整的特征表示,提出了一种新颖的姿态解码器来预测框架末尾的连接特征图所对应的姿态转换。 最后,根据相对姿态计算出内窥镜的绝对位置。实验在三个不同内窥镜场景的数据集上进行,并且结果表明该方法优于现有的前沿技术。此外,所提方法的推理速度超过每秒30帧,满足了实时需求。项目页面在此:\[此链接\](请将方括号中的内容替换为实际提供的URL)。
https://arxiv.org/abs/2501.18124
An important tool for experimental fluids mechanics research is Particle Image Velocimetry (PIV). Several robust methodologies have been proposed to perform the estimation of velocity field from the images, however, alternative methods are still needed to increase the spatial resolution of the results. This work presents a novel approach for estimating fluid flow fields using neural networks and the optical flow equation to predict displacement vectors between sequential images. The result is a continuous representation of the displacement, that can be evaluated on the full spatial resolution of the image. The methodology was validated on synthetic and experimental images. Accurate results were obtained in terms of the estimation of instantaneous velocity fields, and of the determined time average turbulence quantities and power spectral density. The methodology proposed differs of previous attempts of using machine learning for this task: it does not require any previous training, and could be directly used in any pair of images.
粒子图像测速法(PIV)是实验流体动力学研究中的一种重要工具。为了从图像中估算速度场,已经提出了几种稳健的方法论,但仍需提出新的方法来提高结果的空间分辨率。本文介绍了一种使用神经网络和光流方程预测连续图像之间位移向量的新型流体流动场估计方法。这种方法可以产生整个图像空间分辨率上的连续位移表示。 该方法已在合成图像和实验图像上进行了验证,并且在瞬时速度场估算、确定的时间平均湍流量以及功率谱密度方面取得了准确的结果。所提出的方案与之前使用机器学习完成此任务的尝试有所不同:它不需要任何前期训练,可以直接应用于任意两幅图像之间。
https://arxiv.org/abs/2501.18641
We introduce VIBA, a novel approach for explainable video classification by adapting Information Bottlenecks for Attribution (IBA) to video sequences. While most traditional explainability methods are designed for image models, our IBA framework addresses the need for explainability in temporal models used for video analysis. To demonstrate its effectiveness, we apply VIBA to video deepfake detection, testing it on two architectures: the Xception model for spatial features and a VGG11-based model for capturing motion dynamics through optical flow. Using a custom dataset that reflects recent deepfake generation techniques, we adapt IBA to create relevance and optical flow maps, visually highlighting manipulated regions and motion inconsistencies. Our results show that VIBA generates temporally and spatially consistent explanations, which align closely with human annotations, thus providing interpretability for video classification and particularly for deepfake detection.
我们介绍了一种新的方法VIBA,这是一种通过将信息瓶颈归因法(Information Bottlenecks for Attribution,简称IBA)应用于视频序列来实现可解释性视频分类的方法。大多数传统的可解释性方法是为图像模型设计的,而我们的IBA框架则解决了在用于视频分析的时间模型中需要可解释性的问题。为了证明其有效性,我们将VIBA应用于视频深度伪造检测,并针对两种架构进行了测试:一种是使用Xception模型来提取空间特征,另一种则是基于VGG11模型通过光学流捕捉运动动态的架构。 我们利用了一个定制的数据集来进行实验,该数据集反映了最新的深度伪造生成技术。在这一基础上,我们将IBA方法扩展到创建相关性和光学流图,以视觉方式突出显示被篡改的区域和运动不一致性。我们的结果显示,VIBA能够生成与人类标注高度一致的时间和空间上连贯的解释,从而为视频分类及其特定应用——深度伪造检测提供了可解释性。
https://arxiv.org/abs/2501.16889
The rapid development of Deepfake technology has enabled the generation of highly realistic manipulated videos, posing severe social and ethical challenges. Existing Deepfake detection methods primarily focused on either spatial or temporal inconsistencies, often neglecting the interplay between the two or suffering from interference caused by natural facial motions. To address these challenges, we propose the global context consistency flow (GC-ConsFlow), a novel dual-stream framework that effectively integrates spatial and temporal features for robust Deepfake detection. The global grouped context aggregation module (GGCA), integrated into the global context-aware frame flow stream (GCAF), enhances spatial feature extraction by aggregating grouped global context information, enabling the detection of subtle, spatial artifacts within frames. The flow-gradient temporal consistency stream (FGTC), rather than directly modeling the residuals, it is used to improve the robustness of temporal feature extraction against the inconsistency introduced by unnatural facial motion using optical flow residuals and gradient-based features. By combining these two streams, GC-ConsFlow demonstrates the effectiveness and robustness in capturing complementary spatiotemporal forgery traces. Extensive experiments show that GC-ConsFlow outperforms existing state-of-the-art methods in detecting Deepfake videos under various compression scenarios.
深度伪造技术的快速发展使得生成高度逼真的篡改视频成为可能,这引发了严重的社会和伦理挑战。现有的深度伪造检测方法主要集中在空间或时间上的不一致性上,往往忽视了这两者之间的相互作用,或者因自然面部运动而遭受干扰。为了解决这些挑战,我们提出了全局上下文一致流(GC-ConsFlow),这是一种新颖的双流框架,能够有效地整合空间和时间特征以实现稳健的深度伪造检测。 该方法包含两个主要模块:一是集成在全局上下文感知帧流(GCAF)中的全局分组上下文聚合模块(GGCA),通过汇集分组后的全球上下文信息来增强空间特征提取能力,从而能够识别出图像中细微的空间异常。二是使用光流残差和基于梯度的特性来提高时间特征提取对不自然面部运动引入的时间不一致性抵御力的流量-梯度时间一致性流(FGTC)。 通过结合这两个模块,GC-ConsFlow展示了捕捉互补时空伪造痕迹的有效性和稳健性。广泛的实验表明,在各种压缩场景下,GC-ConsFlow在检测深度伪造视频方面优于现有的最先进的方法。
https://arxiv.org/abs/2501.13435
Dynamic urban environments, characterized by moving cameras and objects, pose significant challenges for camera trajectory estimation by complicating the distinction between camera-induced and object motion. We introduce MONA, a novel framework designed for robust moving object detection and segmentation from videos shot by dynamic cameras. MONA comprises two key modules: Dynamic Points Extraction, which leverages optical flow and tracking any point to identify dynamic points, and Moving Object Segmentation, which employs adaptive bounding box filtering, and the Segment Anything for precise moving object segmentation. We validate MONA by integrating with the camera trajectory estimation method LEAP-VO, and it achieves state-of-the-art results on the MPI Sintel dataset comparing to existing methods. These results demonstrate MONA's effectiveness for moving object detection and its potential in many other applications in the urban planning field.
动态的城市环境,其中包含移动的摄像机和物体,给相机轨迹估计带来了重大挑战,因为这使得区分由摄像机引起的运动与物体本身的运动变得复杂。为此,我们引入了MONA,这是一个新颖的框架,专为从动态摄像机拍摄的视频中稳健地检测和分割移动对象而设计。 MONA 包含两个关键模块:动态点提取(Dynamic Points Extraction),该模块利用光学流并跟踪任何点以识别动态点;以及移动物体分割(Moving Object Segmentation)模块,该模块采用自适应边界框过滤,并使用“Segment Anything”技术进行精确的移动对象分割。 我们通过将MONA与相机轨迹估计方法LEAP-VO集成在一起,在MPI Sintel数据集上验证了其性能。MONA在这一基准测试中取得了超越现有方法的最佳结果,这表明MONA在检测移动物体方面的有效性及其在城市规划领域的其他应用中的潜力。
https://arxiv.org/abs/2501.13183
Our research aims to develop machines that learn to perceive visual motion as do humans. While recent advances in computer vision (CV) have enabled DNN-based models to accurately estimate optical flow in naturalistic images, a significant disparity remains between CV models and the biological visual system in both architecture and behavior. This disparity includes humans' ability to perceive the motion of higher-order image features (second-order motion), which many CV models fail to capture because of their reliance on the intensity conservation law. Our model architecture mimics the cortical V1-MT motion processing pathway, utilizing a trainable motion energy sensor bank and a recurrent graph network. Supervised learning employing diverse naturalistic videos allows the model to replicate psychophysical and physiological findings about first-order (luminance-based) motion perception. For second-order motion, inspired by neuroscientific findings, the model includes an additional sensing pathway with nonlinear preprocessing before motion energy sensing, implemented using a simple multilayer 3D CNN block. When exploring how the brain acquired the ability to perceive second-order motion in natural environments, in which pure second-order signals are rare, we hypothesized that second-order mechanisms were critical when estimating robust object motion amidst optical fluctuations, such as highlights on glossy surfaces. We trained our dual-pathway model on novel motion datasets with varying material properties of moving objects. We found that training to estimate object motion from non-Lambertian materials naturally endowed the model with the capacity to perceive second-order motion, as can humans. The resulting model effectively aligns with biological systems while generalizing to both first- and second-order motion phenomena in natural scenes.
我们的研究旨在开发能够像人类一样学习感知视觉运动的机器。尽管计算机视觉(CV)领域的最新进展已经使基于深度神经网络(DNN)的模型能够在自然图像中准确估计光流,但在架构和行为上,这些CV模型与生物视觉系统之间仍然存在显著差异。这种差异包括了人类能够感知更高阶图像特征运动的能力(第二级运动),而许多CV模型由于依赖于强度保持定律因此无法捕捉到这一点。 我们的模型架构模仿了大脑皮层V1-MT运动处理路径,使用可训练的运动能量传感器银行和循环图网络。通过使用各种自然视频进行监督学习,该模型能够复制关于第一级(基于亮度)运动感知的心理物理和生理学发现。对于第二级运动,在神经科学研究的基础上,我们加入了额外的感觉通路,并在运动能量检测前进行了非线性预处理,这是通过简单的多层3D卷积网络实现的。 当我们探索大脑如何在自然环境中获得感知第二级运动的能力时(在这种环境中,纯第二级信号很少),我们认为当估算物体在光学波动中的稳健运动(如光泽表面的高光)时,第二级机制至关重要。我们使用具有不同材料属性的移动对象的新颖运动数据集对双通道模型进行了训练。结果发现,在非朗伯材料上进行估计物体质心运动的训练自然赋予了该模型感知第二级运动的能力,就像人类一样。 最终得到的模型在生物系统中有效地与之保持一致,并且能够推广到自然界场景中的第一和第二级运动现象。
https://arxiv.org/abs/2501.12810
We consider the problem of segmenting objects in videos based on their motion and no other forms of supervision. Prior work has often approached this problem by using the principle of common fate, namely the fact that the motion of points that belong to the same object is strongly correlated. However, most authors have only considered instantaneous motion from optical flow. In this work, we present a way to train a segmentation network using long-term point trajectories as a supervisory signal to complement optical flow. The key difficulty is that long-term motion, unlike instantaneous motion, is difficult to model -- any parametric approximation is unlikely to capture complex motion patterns over long periods of time. We instead draw inspiration from subspace clustering approaches, proposing a loss function that seeks to group the trajectories into low-rank matrices where the motion of object points can be approximately explained as a linear combination of other point tracks. Our method outperforms the prior art on motion-based segmentation, which shows the utility of long-term motion and the effectiveness of our formulation.
我们考虑根据视频中物体的运动来对其进行分割,而不依赖任何形式的其他监督信息。以往的研究通常通过利用“共命运原则”来解决这个问题,即属于同一对象的点的运动是高度相关的。然而,大多数研究仅限于使用即时光学流(optical flow)来进行这种关联分析。在本工作中,我们提出了一种训练分割网络的方法,该方法采用长期点轨迹作为监督信号以补充光学流信息。关键挑战在于,与瞬时运动相比,长时间的运动模型化更加困难——任何参数化近似都不大可能捕捉到长时段内复杂的运动模式。 为了解决这个问题,我们的研究从子空间聚类(subspace clustering)方法中汲取灵感,并提出了一种损失函数,该函数试图将轨迹分组成低秩矩阵,在这种矩阵中,物体点的运动可以通过其他点迹线的线性组合近似解释。实验表明,我们所提出的方法在基于运动进行分割的任务上超越了先前的研究成果,这不仅证明了长期运动信息的价值,还验证了我们的方法的有效性。
https://arxiv.org/abs/2501.12392
Depth Anything has achieved remarkable success in monocular depth estimation with strong generalization ability. However, it suffers from temporal inconsistency in videos, hindering its practical applications. Various methods have been proposed to alleviate this issue by leveraging video generation models or introducing priors from optical flow and camera poses. Nonetheless, these methods are only applicable to short videos (< 10 seconds) and require a trade-off between quality and computational efficiency. We propose Video Depth Anything for high-quality, consistent depth estimation in super-long videos (over several minutes) without sacrificing efficiency. We base our model on Depth Anything V2 and replace its head with an efficient spatial-temporal head. We design a straightforward yet effective temporal consistency loss by constraining the temporal depth gradient, eliminating the need for additional geometric priors. The model is trained on a joint dataset of video depth and unlabeled images, similar to Depth Anything V2. Moreover, a novel key-frame-based strategy is developed for long video inference. Experiments show that our model can be applied to arbitrarily long videos without compromising quality, consistency, or generalization ability. Comprehensive evaluations on multiple video benchmarks demonstrate that our approach sets a new state-of-the-art in zero-shot video depth estimation. We offer models of different scales to support a range of scenarios, with our smallest model capable of real-time performance at 30 FPS.
Depth Anything 在单目深度估计方面取得了显著的成功,并且具有强大的泛化能力。然而,它在视频中存在时间上的不一致性问题,这限制了其实际应用。为了缓解这个问题,已经提出了多种方法,这些方法通过利用视频生成模型或从光流和相机姿态引入先验信息来实现。尽管如此,这些方法仅适用于短视频(< 10秒),并且需要在质量和计算效率之间进行权衡。我们提出了一种名为 Video Depth Anything 的方法,该方法可以在超长视频(超过几分钟)上提供高质量、一致的深度估计,并且不会牺牲效率。 我们的模型基于 Depth Anything V2 构建,并用一个高效的时空头替换了它的头部。我们设计了一个简单而有效的时序一致性损失函数,通过限制时间深度梯度来实现这一目标,从而消除了对额外几何先验的需求。该模型在视频深度和未标记图像的联合数据集上进行训练,类似于 Depth Anything V2 的方法。此外,为长视频推理开发了一种新颖的关键帧策略。 实验表明,我们的模型可以应用于任意长度的视频而不损害质量、一致性和泛化能力。在多个视频基准上的综合评估显示,我们的方法在零样本视频深度估计方面设立了新的最先进的水平。我们提供了不同规模的模型来支持各种场景,其中最小的模型能够在 30 FPS 的情况下实现实时性能。
https://arxiv.org/abs/2501.12375
Recent video inpainting methods have achieved encouraging improvements by leveraging optical flow to guide pixel propagation from reference frames either in the image space or feature space. However, they would produce severe artifacts in the mask center when the masked area is too large and no pixel correspondences can be found for the center. Recently, diffusion models have demonstrated impressive performance in generating diverse and high-quality images, and have been exploited in a number of works for image inpainting. These methods, however, cannot be applied directly to videos to produce temporal-coherent inpainting results. In this paper, we propose a training-free framework, named VipDiff, for conditioning diffusion model on the reverse diffusion process to produce temporal-coherent inpainting results without requiring any training data or fine-tuning the pre-trained diffusion models. VipDiff takes optical flow as guidance to extract valid pixels from reference frames to serve as constraints in optimizing the randomly sampled Gaussian noise, and uses the generated results for further pixel propagation and conditional generation. VipDiff also allows for generating diverse video inpainting results over different sampled noise. Experiments demonstrate that VipDiff can largely outperform state-of-the-art video inpainting methods in terms of both spatial-temporal coherence and fidelity.
近期的视频修复方法通过利用光学流来指导像素从参考帧在图像空间或特征空间中的传播,取得了令人鼓舞的进步。然而,当掩码区域过大且无法找到中心位置的像素对应关系时,这些方法会产生严重的伪影。最近,扩散模型(diffusion models)在生成多样性和高质量的图像方面表现出色,并已在若干研究中用于图像修复工作。不过,直接将这些方法应用于视频以产生时间连贯性的修复结果是不可行的。 在这篇论文中,我们提出了一种无需训练的框架VipDiff,通过调整扩散模型在反向扩散过程中的条件设置来生成时间连贯性修复结果,而无需任何训练数据或微调预训练的扩散模型。VipDiff利用光学流作为指导从参考帧提取有效像素以用作优化随机采样的高斯噪声的约束,并使用生成的结果进行进一步的像素传播和有条件生成。此外,VipDiff允许根据不同的噪声样本生成多样化的视频修复结果。 实验表明,在时空连贯性和保真度方面,VipDiff可以大幅超越现有的视频修复方法。
https://arxiv.org/abs/2501.12267
Egomotion estimation is crucial for applications such as autonomous navigation and robotics, where accurate and real-time motion tracking is required. However, traditional methods relying on inertial sensors are highly sensitive to external conditions, and suffer from drifts leading to large inaccuracies over long distances. Vision-based methods, particularly those utilising event-based vision sensors, provide an efficient alternative by capturing data only when changes are perceived in the scene. This approach minimises power consumption while delivering high-speed, low-latency feedback. In this work, we propose a fully event-based pipeline for egomotion estimation that processes the event stream directly within the event-based domain. This method eliminates the need for frame-based intermediaries, allowing for low-latency and energy-efficient motion estimation. We construct a shallow spiking neural network using a synaptic gating mechanism to convert precise event timing into bursts of spikes. These spikes encode local optical flow velocities, and the network provides an event-based readout of egomotion. We evaluate the network's performance on a dedicated chip, demonstrating strong potential for low-latency, low-power motion estimation. Additionally, simulations of larger networks show that the system achieves state-of-the-art accuracy in egomotion estimation tasks with event-based cameras, making it a promising solution for real-time, power-constrained robotics applications.
姿态估计在自主导航和机器人技术等应用中至关重要,这些领域需要精确且实时的运动追踪。然而,传统的基于惯性传感器的方法对外部条件非常敏感,并且由于漂移问题,在长距离内会导致较大的不准确性。而视觉方法特别是使用事件驱动视觉传感器的方法,则通过仅在场景发生变化时捕捉数据提供了一种高效的替代方案。这种方法减少了能耗的同时提供了高速、低延迟的反馈。 在这项工作中,我们提出了一种全事件驱动的数据流处理管道来实现姿态估计,该管道直接处理基于事件的数据流。这种方法消除了对帧间中介的需求,从而实现了低延迟和能源效率的运动估算。我们构建了一个浅层尖峰神经网络,并利用突触门控机制将精确的时间事件转换成尖峰爆发。这些尖峰编码了局部光流速度,并且该网络提供了一种基于事件的姿态反馈。 我们在专门的芯片上评估了该网络的表现,展示了低延迟和低功耗姿态估计的强大潜力。此外,对更大规模网络进行的仿真表明,在基于事件相机的姿态估计任务中,系统达到了业界领先的精度水平,为实时、功率受限的机器人应用提供了一个有前途的解决方案。
https://arxiv.org/abs/2501.11554
Enhancing low-resolution, low-frame-rate videos to high-resolution, high-frame-rate quality is essential for a seamless user experience, motivating advancements in Continuous Spatial-Temporal Video Super Resolution (C-STVSR). While prior methods employ Implicit Neural Representation (INR) for continuous encoding, they often struggle to capture the complexity of video data, relying on simple coordinate concatenation and pre-trained optical flow network for motion representation. Interestingly, we find that adding position encoding, contrary to common observations, does not improve-and even degrade performance. This issue becomes particularly pronounced when combined with pre-trained optical flow networks, which can limit the model's flexibility. To address these issues, we propose BF-STVSR, a C-STVSR framework with two key modules tailored to better represent spatial and temporal characteristics of video: 1) B-spline Mapper for smooth temporal interpolation, and 2) Fourier Mapper for capturing dominant spatial frequencies. Our approach achieves state-of-the-art PSNR and SSIM performance, showing enhanced spatial details and natural temporal consistency.
提高低分辨率、低帧率视频的清晰度和流畅度,以达到高分辨率和高帧率的质量对于提供无缝用户体验至关重要。这促使了连续时空视频超分辨率(Continuous Spatial-Temporal Video Super Resolution,C-STVSR)技术的发展。先前的方法采用隐式神经表示(Implicit Neural Representation,INR)进行连续编码时,虽然尝试捕捉视频数据的复杂性,但常常依赖于简单的坐标拼接和预训练光流网络来表达运动信息,从而难以充分应对挑战。 令人意外的是,我们发现加入位置编码反而不像常规观察中所预期的那样能提高性能,甚至可能降低模型的表现。这一问题在结合使用预训练光流网络时尤为明显,这可能会限制模型的灵活性和适应性。为了解决这些问题,我们提出了一种名为BF-STVSR的新框架,这是一种针对视频空间与时间特性量身定制的C-STVSR方法,包括两个关键模块: 1. **B样条映射器**:用于平滑的时间插值。 2. **傅里叶映射器**:用于捕捉主要的空间频率。 通过这种方法,我们的系统不仅实现了业界领先的峰值信噪比(PSNR)和结构相似度指数(SSIM)性能指标,还在图像细节增强及自然时间一致性方面表现出色。
https://arxiv.org/abs/2501.11043
Purpose. This paper explores the capability of smartphones as computing devices for a quadcopter, specifically in terms of the ability of drones to maintain a position known as the position hold function. Image processing can be performed with the phone's sensors and powerful built-in camera. Method. Using Shi-Tomasi corner detection and the Lucas-Kanade sparse optical flow algorithms, ground features are recognized and tracked using the downward-facing camera. The position is maintained by computing quadcopter displacement from the center of the image using Euclidian distance, and the corresponding pitch and roll estimate is calculated using the PID controller. Results. Actual flights show a double standard deviation of 18.66 cm from the center for outdoor tests. With a quadcopter size of 58cm x 58cm used, it implies that 95% of the time, the quadcopter is within a diameter of 96 cm. For indoor tests, a double standard deviation of 10.55 cm means that 95% of the time, the quadcopter is within a diameter of 79 cm. Conclusion. Smartphone sensors and cameras can be used to perform optical flow position hold functions, proving their potential as computing devices for drones. Recommendations. To further improve the positioning system of the phone-based quadcopter system, it is suggested that potential sensor fusion be explored with the phone's GNSS sensor, which gives absolute positioning information for outdoor applications. Research Implications. As different devices and gadgets are integrated into the smartphone, this paper presents an opportunity for phone manufacturers and researchers to explore the potential of smartphones for a drone use-case.
**目的。** 本文探讨了智能手机作为四旋翼无人机计算设备的能力,特别是在无人机保持已知位置(称为“悬停”功能)方面的性能。利用手机的传感器和内置摄像头可以执行图像处理任务。 **方法。** 使用Shi-Tomasi角点检测算法和Lucas-Kanade稀疏光流算法识别并跟踪地面特征,并使用向下的摄像头进行操作。通过计算四旋翼无人机在图像中心处的位置偏移,利用欧几里得距离来维持位置,并使用PID控制器估算相应的俯仰和滚转角度。 **结果。** 实际飞行测试显示,在户外试验中,双标准差为18.66厘米;考虑到所使用的58cm x 58cm尺寸的四旋翼机,这意味着95%的时间内,无人机保持在直径为96厘米范围内。而在室内实验中,双标准差为10.55厘米,这表明95%的时间里,无人机在79厘米直径的区域内。 **结论。** 智能手机的传感器和摄像头可用于执行光流悬停功能,证明了其作为无人机计算设备的巨大潜力。 **建议。** 为进一步提升基于智能手机四旋翼机系统的定位系统性能,建议探索与手机GNSS(全球导航卫星系统)传感器融合的可能性,这将为户外应用提供绝对位置信息。 **研究意义。** 随着各种设备和配件被集成到智能手机中,本文为手机制造商和研究人员提供了探索智能手机在无人机使用场景中的潜力的机会。
https://arxiv.org/abs/2501.10752
Given a video and a linguistic query, video moment retrieval and highlight detection (MR&HD) aim to locate all the relevant spans while simultaneously predicting saliency scores. Most existing methods utilize RGB images as input, overlooking the inherent multi-modal visual signals like optical flow and depth. In this paper, we propose a Multi-modal Fusion and Query Refinement Network (MRNet) to learn complementary information from multi-modal cues. Specifically, we design a multi-modal fusion module to dynamically combine RGB, optical flow, and depth map. Furthermore, to simulate human understanding of sentences, we introduce a query refinement module that merges text at different granularities, containing word-, phrase-, and sentence-wise levels. Comprehensive experiments on QVHighlights and Charades datasets indicate that MRNet outperforms current state-of-the-art methods, achieving notable improvements in MR-mAP@Avg (+3.41) and HD-HIT@1 (+3.46) on QVHighlights.
给定一个视频和语言查询,视频片段检索与高亮检测(MR&HD)的目标是定位所有相关的片段,并同时预测其显著性分数。大多数现有方法都仅利用RGB图像作为输入,而忽视了如光流和深度等内在的多模态视觉信号。在本文中,我们提出了一种多模态融合与查询精炼网络(MRNet),用于从多模态线索中学习互补信息。具体来说,我们设计了一个多模态融合模块来动态结合RGB图像、光流和深度图。此外,为了模拟人类对句子的理解过程,我们引入了一个查询精炼模块,在不同粒度级别上合并文本,包括单词级、短语级和句子级。在QVHighlights和Charades数据集上的全面实验表明,MRNet优于目前最先进的方法,并且在QVHighlights数据集中实现了显著的性能提升(MR-mAP@Avg提高了3.41%,HD-HIT@1提高了3.46%)。
https://arxiv.org/abs/2501.10692
Recent video inpainting algorithms integrate flow-based pixel propagation with transformer-based generation to leverage optical flow for restoring textures and objects using information from neighboring frames, while completing masked regions through visual Transformers. However, these approaches often encounter blurring and temporal inconsistencies when dealing with large masks, highlighting the need for models with enhanced generative capabilities. Recently, diffusion models have emerged as a prominent technique in image and video generation due to their impressive performance. In this paper, we introduce DiffuEraser, a video inpainting model based on stable diffusion, designed to fill masked regions with greater details and more coherent structures. We incorporate prior information to provide initialization and weak conditioning,which helps mitigate noisy artifacts and suppress hallucinations. Additionally, to improve temporal consistency during long-sequence inference, we expand the temporal receptive fields of both the prior model and DiffuEraser, and further enhance consistency by leveraging the temporal smoothing property of Video Diffusion Models. Experimental results demonstrate that our proposed method outperforms state-of-the-art techniques in both content completeness and temporal consistency while maintaining acceptable efficiency.
最近的视频修复算法结合了基于流的像素传播和基于变压器的生成技术,利用光学流动来恢复纹理和对象,并通过视觉变压器完成被遮罩区域。然而,这些方法在处理大范围掩码时经常遇到模糊和时间不一致的问题,这凸显了需要具有增强生成能力的模型。最近,扩散模型因其卓越的表现而成为图像和视频生成领域的突出技术。在这篇论文中,我们引入了一种基于稳定扩散的视频修复模型DiffuEraser,旨在以更详细的内容和更具连贯性的结构填充被遮罩区域。我们整合了先验信息来提供初始化和弱条件设置,这有助于减少噪声效应并抑制幻觉。此外,为了在长序列推理过程中提高时间一致性,我们将先前模型和DiffuEraser的时间感知域扩展,并进一步利用视频扩散模型的时间平滑特性来增强一致性。 实验结果表明,我们提出的方法在内容完整性和时间一致性方面均优于最先进的技术,并且保持了可接受的效率。
https://arxiv.org/abs/2501.10018
Video colorization aims to transform grayscale videos into vivid color representations while maintaining temporal consistency and structural integrity. Existing video colorization methods often suffer from color bleeding and lack comprehensive control, particularly under complex motion or diverse semantic cues. To this end, we introduce VanGogh, a unified multimodal diffusion-based framework for video colorization. VanGogh tackles these challenges using a Dual Qformer to align and fuse features from multiple modalities, complemented by a depth-guided generation process and an optical flow loss, which help reduce color overflow. Additionally, a color injection strategy and luma channel replacement are implemented to improve generalization and mitigate flickering artifacts. Thanks to this design, users can exercise both global and local control over the generation process, resulting in higher-quality colorized videos. Extensive qualitative and quantitative evaluations, and user studies, demonstrate that VanGogh achieves superior temporal consistency and color this http URL page: this https URL.
视频着色的目标是将灰度视频转换为生动的彩色表示,同时保持时间一致性与结构完整性。现有的视频着色方法通常在处理复杂运动或多样语义提示时会出现色彩溢出问题,并且缺乏全面控制。 为此,我们引入了VanGogh——一种统一的多模态扩散框架用于视频着色。VanGogh通过使用Dual Qformer来对齐和融合来自多种模式的特征来解决这些问题,该方法还结合深度引导生成过程以及光流损失,以减少色彩溢出。此外,实施了一种颜色注入策略和亮度通道替换操作,用以提升泛化能力和减轻闪烁伪影。 由于这些设计特点,用户可以在整个过程中进行全局或局部控制,从而生成更高质量的着色视频。大量的定性和定量评估以及用户研究表明,VanGogh在时间一致性与色彩质量方面表现优异。详情请参见[此页面](https://this https URL)。 请注意,原文中的链接格式有误,请将"this https URL"替换为实际的有效网址以供访问。
https://arxiv.org/abs/2501.09499
Generative modeling aims to transform random noise into structured outputs. In this work, we enhance video diffusion models by allowing motion control via structured latent noise sampling. This is achieved by just a change in data: we pre-process training videos to yield structured noise. Consequently, our method is agnostic to diffusion model design, requiring no changes to model architectures or training pipelines. Specifically, we propose a novel noise warping algorithm, fast enough to run in real time, that replaces random temporal Gaussianity with correlated warped noise derived from optical flow fields, while preserving the spatial Gaussianity. The efficiency of our algorithm enables us to fine-tune modern video diffusion base models using warped noise with minimal overhead, and provide a one-stop solution for a wide range of user-friendly motion control: local object motion control, global camera movement control, and motion transfer. The harmonization between temporal coherence and spatial Gaussianity in our warped noise leads to effective motion control while maintaining per-frame pixel quality. Extensive experiments and user studies demonstrate the advantages of our method, making it a robust and scalable approach for controlling motion in video diffusion models. Video results are available on our webpage: this https URL source code and model checkpoints are available on GitHub: this https URL.
生成式模型旨在将随机噪声转化为结构化的输出。在这项工作中,我们通过允许通过结构化潜在噪声采样来控制运动,从而增强了视频扩散模型。这一改进仅需数据处理上的改变:我们将训练视频预处理以产生结构化的噪声。因此,我们的方法对扩散模型的设计是独立的,无需更改模型架构或训练流程。 具体来说,我们提出了一种新颖的噪声扭曲算法,其运行速度足够快,可以在实时环境中执行。该算法将随机时间高斯噪声替换为由光流场衍生的相关扭曲噪声,同时保持空间上的高斯分布特性。我们的算法效率使得我们可以使用扭曲后的噪声对现代视频扩散基础模型进行微调,并且几乎不需要额外开销。这提供了一站式的解决方案,涵盖了多种用户友好的运动控制需求:局部对象的移动控制、全局摄像机运动控制以及动作迁移。 我们方法中的扭曲噪声在时间连贯性和空间高斯分布之间的和谐关系导致了有效的运动控制,同时保持了每一帧像素的质量。广泛的实验和用户体验研究证明了我们的方法的优势,使其成为视频扩散模型中控制运动的强大且可扩展的方法。视频结果可在我们的网页上查看:[此URL]。源代码和模型检查点可以在GitHub上找到:[此URL]。
https://arxiv.org/abs/2501.08331
Weakly supervised violence detection refers to the technique of training models to identify violent segments in videos using only video-level labels. Among these approaches, multimodal violence detection, which integrates modalities such as audio and optical flow, holds great potential. Existing methods in this domain primarily focus on designing multimodal fusion models to address modality discrepancies. In contrast, we take a different approach; leveraging the inherent discrepancies across modalities in violence event representation to propose a novel multimodal semantic feature alignment method. This method sparsely maps the semantic features of local, transient, and less informative modalities ( such as audio and optical flow ) into the more informative RGB semantic feature space. Through an iterative process, the method identifies the suitable no-zero feature matching subspace and aligns the modality-specific event representations based on this subspace, enabling the full exploitation of information from all modalities during the subsequent modality fusion stage. Building on this, we design a new weakly supervised violence detection framework that consists of unimodal multiple-instance learning for extracting unimodal semantic features, multimodal alignment, multimodal fusion, and final detection. Experimental results on benchmark datasets demonstrate the effectiveness of our method, achieving an average precision (AP) of 86.07% on the XD-Violence dataset. Our code is available at this https URL.
弱监督暴力检测是指利用仅有的视频级别标签来训练模型识别视频中的暴力片段的技术。在这些方法中,多模态暴力检测通过整合音频和光流等模式展现了巨大的潜力。现有的领域方法主要集中在设计多模态融合模型以解决不同模式之间的差异问题上。相比之下,我们采取了不同的策略;利用暴力事件表征跨模式的内在差异来提出一种新颖的多模态语义特征对齐方法。该方法稀疏地将局部、瞬时且信息量较少的模式(如音频和光流)的语义特征映射到信息更丰富的RGB语义特征空间中。通过迭代过程,该方法识别出合适的非零特征匹配子空间,并根据此子空间来对齐特定于每个模态的事件表示,从而在后续多模态融合阶段充分挖掘所有模式的信息。 基于这一基础,我们设计了一个新的弱监督暴力检测框架,包括单模态多次实例学习用于提取单模态语义特征、跨模态对齐、多模态融合以及最终的检测。基准数据集上的实验结果表明了该方法的有效性,在XD-Violence数据集中实现了86.07%的平均精度(AP)。我们的代码可在上述提供的链接处获取。
https://arxiv.org/abs/2501.07496
In this paper, we propose ProTracker, a novel framework for robust and accurate long-term dense tracking of arbitrary points in videos. The key idea of our method is incorporating probabilistic integration to refine multiple predictions from both optical flow and semantic features for robust short-term and long-term tracking. Specifically, we integrate optical flow estimations in a probabilistic manner, producing smooth and accurate trajectories by maximizing the likelihood of each prediction. To effectively re-localize challenging points that disappear and reappear due to occlusion, we further incorporate long-term feature correspondence into our flow predictions for continuous trajectory generation. Extensive experiments show that ProTracker achieves the state-of-the-art performance among unsupervised and self-supervised approaches, and even outperforms supervised methods on several benchmarks. Our code and model will be publicly available upon publication.
在这篇论文中,我们提出了一种名为ProTracker的新型框架,用于在视频中对任意点进行稳健且准确的长期密集跟踪。我们的方法的核心思想是通过结合概率集成来优化来自光流和语义特征的多个预测结果,从而实现短期和长期内的稳健跟踪。具体来说,我们将光流估计以概率的方式整合起来,在最大化每个预测可能性的同时生成平滑而精确的轨迹。为了有效地重新定位由于遮挡而消失又重新出现的具有挑战性的点,我们进一步在我们的光流预测中引入了长期特征对应关系,从而实现连续轨迹的生成。广泛的实验表明,ProTracker在无监督和自监督方法中的性能处于行业领先水平,并且甚至在多个基准测试上超越了有监督的方法。论文发布后,我们的代码和模型将公开提供。
https://arxiv.org/abs/2501.03220