Motion representation plays an important role in video understanding and has many applications including action recognition, robot and autonomous guidance or others. Lately, transformer networks, through their self-attention mechanism capabilities, have proved their efficiency in many applications. In this study, we introduce a new two-stream transformer video classifier, which extracts spatio-temporal information from content and optical flow representing movement information. The proposed model identifies self-attention features across the joint optical flow and temporal frame domain and represents their relationships within the transformer encoder mechanism. The experimental results show that our proposed methodology provides excellent classification results on three well-known video datasets of human activities.
运动表示在视频理解中扮演着重要角色,并且有许多应用,包括动作识别、机器人和自主导航等。最近,通过自注意力机制的能力,变压器网络在许多应用程序中证明了其有效性。在这项研究中,我们引入了一种新的双流变压器视频分类器,该分类器从内容和表示运动信息的光学流中提取时空信息。所提出的模型在联合光流和时间帧域内识别自注意特征,并通过变压器编码机制表示它们之间的关系。实验结果表明,在三个著名的涉及人类活动的视频数据集上,我们提出的方法提供了出色的分类效果。
https://arxiv.org/abs/2601.14086
Glass surface ubiquitous in both daily life and professional environments presents a potential threat to vision-based systems, such as robot and drone navigation. To solve this challenge, most recent studies have shown significant interest in Video Glass Surface Detection (VGSD). We observe that objects in the reflection (or transmission) layer appear farther from the glass surfaces. Consequently, in video motion scenarios, the notable reflected (or transmitted) objects on the glass surface move slower than objects in non-glass regions within the same spatial plane, and this motion inconsistency can effectively reveal the presence of glass surfaces. Based on this observation, we propose a novel network, named MVGD-Net, for detecting glass surfaces in videos by leveraging motion inconsistency cues. Our MVGD-Net features three novel modules: the Cross-scale Multimodal Fusion Module (CMFM) that integrates extracted spatial features and estimated optical flow maps, the History Guided Attention Module (HGAM) and Temporal Cross Attention Module (TCAM), both of which further enhances temporal features. A Temporal-Spatial Decoder (TSD) is also introduced to fuse the spatial and temporal features for generating the glass region mask. Furthermore, for learning our network, we also propose a large-scale dataset, which comprises 312 diverse glass scenarios with a total of 19,268 frames. Extensive experiments demonstrate that our MVGD-Net outperforms relevant state-of-the-art methods.
玻璃表面在日常生活中和专业环境中普遍存在,这给基于视觉的系统(如机器人和无人机导航)带来了潜在威胁。为了解决这一挑战,最近的研究对视频玻璃面检测(VGSD)表现出了浓厚的兴趣。我们观察到,在反射层或透射层中的物体似乎距离玻璃更远。因此,在视频运动场景中,相较于同一平面内的非玻璃区域里的对象,玻璃表面上的显著反射(或透射)物体移动得较慢,这种运动不一致性可以有效地揭示玻璃表面的存在。 基于这一观察,我们提出了一种名为MVGD-Net的新网络,用于通过利用运动不一致线索来检测视频中的玻璃面。我们的MVGD-Net具有三个新颖模块:跨尺度多模态融合模块(CMFM),该模块整合了提取的空间特征和估计的光流图;历史引导注意模块(HGAM)以及时间交叉注意模块(TCAM),这两个模块进一步增强了时序特征。此外,还引入了一个时空解码器(TSD),用于融合空间和时间特征以生成玻璃区域掩模。 为了训练我们的网络,我们还提出了一套大规模的数据集,其中包括312种多样的玻璃场景,总计有19,268帧。广泛的实验表明,与相关最先进的方法相比,我们的MVGD-Net在性能上取得了优越的结果。
https://arxiv.org/abs/2601.13715
Autonomous navigation for nano-scale unmanned aerial vehicles (nano-UAVs) is governed by extreme Size, Weight, and Power (SWaP) constraints (with the weight < 50 g and sub-100 mW onboard processor), distinguishing it fundamentally from standard robotic paradigms. This review synthesizes the state-of-the-art in sensing, computing, and control architectures designed specifically for these sub- 100mW computational envelopes. We critically analyse the transition from classical geometry-based methods to emerging "Edge AI" paradigms, including quantized deep neural networks deployed on ultra-low-power System-on-Chips (SoCs) and neuromorphic event-based control. Beyond algorithms, we evaluate the hardware-software co-design requisite for autonomy, covering advancements in dense optical flow, optimized Simultaneous Localization and Mapping (SLAM), and learning-based flight control. While significant progress has been observed in visual navigation and relative pose estimation, our analysis reveals persistent gaps in long-term endurance, robust obstacle avoidance in dynamic environments, and the "Sim-to-Real" transfer of reinforcement learning policies. This survey provides a roadmap for bridging these gaps, advocating for hybrid architectures that fuse lightweight classical control with data-driven perception to enable fully autonomous, agile nano-UAVs in GPS-denied environments.
纳米级无人飞行器(nano-UAV)的自主导航受极端尺寸、重量和功耗(SWaP)限制的影响,其重量小于50克且机载处理器功率低于100毫瓦,这与标准机器人范式有根本区别。这篇综述总结了为这些低至100毫瓦计算能力设计的传感、计算及控制架构的最新进展。我们批判性地分析了从传统几何方法向新兴“边缘AI”(Edge AI)范式的转变,包括在超低功耗片上系统(SoCs)上部署量化深度神经网络以及基于事件的神经形态控制。除了算法之外,还评估了实现自主性的硬件-软件协同设计需求,涵盖了密集光流、优化的同时定位与地图构建(SLAM)和学习型飞行控制的进步。 尽管在视觉导航和相对姿态估计方面已经取得了显著进展,但我们的分析揭示了长期续航能力不足、动态环境中的鲁棒性避障以及强化学习策略的“仿真到实际”迁移等方面的持续差距。本调查提供了弥合这些差距的道路图,倡导融合轻量级经典控制与数据驱动感知的混合架构,以实现在没有全球定位系统(GPS)支持环境中完全自主且敏捷的纳米无人机飞行。
https://arxiv.org/abs/2601.13252
Video diffusion models, trained on large-scale datasets, naturally capture correspondences of shared features across frames. Recent works have exploited this property for tasks such as optical flow prediction and tracking in a zero-shot setting. Motivated by these findings, we investigate whether supervised training can more fully harness the tracking capability of video diffusion models. To this end, we propose Moaw, a framework that unleashes motion awareness for video diffusion models and leverages it to facilitate motion transfer. Specifically, we train a diffusion model for motion perception, shifting its modality from image-to-video generation to video-to-dense-tracking. We then construct a motion-labeled dataset to identify features that encode the strongest motion information, and inject them into a structurally identical video generation model. Owing to the homogeneity between the two networks, these features can be naturally adapted in a zero-shot manner, enabling motion transfer without additional adapters. Our work provides a new paradigm for bridging generative modeling and motion understanding, paving the way for more unified and controllable video learning frameworks.
视频扩散模型在大规模数据集上训练后,能够自然地捕捉帧间共享特征的对应关系。最近的一些研究利用这一特性来执行零样本设置下的光流预测和跟踪任务。受这些发现启发,我们探讨了监督学习是否可以更充分地发挥视频扩散模型的跟踪能力。为此,我们提出了Moaw框架,该框架使视频扩散模型具备运动感知,并借此促进运动迁移。具体来说,我们训练了一个用于运动感知的扩散模型,将其模态从图像到视频生成转换为视频到稠密跟踪。然后构建一个带有运动标签的数据集来识别编码最强运动信息的特征,并将这些特征注入到结构相同但用于视频生成的模型中。由于两个网络之间的同质性,在零样本设置下可以自然地适应这些特征,从而无需额外适配器即可实现运动迁移。我们的工作为生成式建模和运动理解之间架起了桥梁,为更统一、可控的视频学习框架铺平了道路。
https://arxiv.org/abs/2601.12761
Future motion representations, such as optical flow, offer immense value for control and generative tasks. However, forecasting generalizable spatially dense motion representations remains a key challenge, and learning such forecasting from noisy, real-world data remains relatively unexplored. We introduce FOFPred, a novel language-conditioned optical flow forecasting model featuring a unified Vision-Language Model (VLM) and Diffusion architecture. This unique combination enables strong multimodal reasoning with pixel-level generative fidelity for future motion prediction. Our model is trained on web-scale human activity data-a highly scalable but unstructured source. To extract meaningful signals from this noisy video-caption data, we employ crucial data preprocessing techniques and our unified architecture with strong image pretraining. The resulting trained model is then extended to tackle two distinct downstream tasks in control and generation. Evaluations across robotic manipulation and video generation under language-driven settings establish the cross-domain versatility of FOFPred, confirming the value of a unified VLM-Diffusion architecture and scalable learning from diverse web data for future optical flow prediction.
未来运动表示(如光流)对于控制和生成任务具有巨大价值。然而,预测通用的密集空间运动表示仍然是一个关键挑战,并且从嘈杂的真实世界数据中学习这种预测方法的研究相对较少。我们引入了FOFPred,这是一个新颖的语言条件下的光流预测模型,它结合了一个统一的视觉-语言模型(VLM)和扩散架构。这一独特的组合使强大的多模态推理成为可能,并实现了未来运动预测中的像素级生成准确性。我们的模型在大规模网络数据上的人类活动视频描述数据上进行训练——这是一个高度可扩展但又结构化的来源。 为了从这些嘈杂的视频-描述数据中提取有意义的信息,我们采用了关键的数据预处理技术以及统一架构和强大的图像预训练方法。然后,我们将经过训练的模型应用于控制和生成两个不同的下游任务。在机器人操作和基于语言驱动条件下的视频生成评估中,FOFPred展示了其跨领域的适用性,这证实了统一的VLM-扩散架构的价值,并证明了从多样化的网络数据中进行大规模学习对未来光流预测的重要性。
https://arxiv.org/abs/2601.10781
Underwater imaging is fundamentally challenging due to wavelength-dependent light attenuation, strong scattering from suspended particles, turbidity-induced blur, and non-uniform illumination. These effects impair standard cameras and make ground-truth motion nearly impossible to obtain. On the other hand, event cameras offer microsecond resolution and high dynamic range. Nonetheless, progress on investigating event cameras for underwater environments has been limited due to the lack of datasets that pair realistic underwater optics with accurate optical flow. To address this problem, we introduce the first synthetic underwater benchmark dataset for event-based optical flow derived from physically-based ray-traced RGBD sequences. Using a modern video-to-event pipeline applied to rendered underwater videos, we produce realistic event data streams with dense ground-truth flow, depth, and camera motion. Moreover, we benchmark state-of-the-art learning-based and model-based optical flow prediction methods to understand how underwater light transport affects event formation and motion estimation accuracy. Our dataset establishes a new baseline for future development and evaluation of underwater event-based perception algorithms. The source code and dataset for this project are publicly available at this https URL.
由于光的波长依赖性衰减、悬浮颗粒引起的强烈散射、浑浊导致的模糊以及非均匀照明,水下成像本质上具有挑战性。这些效应影响了标准相机的功能,并使获取地面实况运动变得几乎不可能。另一方面,事件相机提供微秒级分辨率和高动态范围。然而,由于缺乏将逼真的水下光学与准确光流相结合的数据集,对事件相机在水下环境中应用的研究进展受到限制。 为了解决这个问题,我们引入了首个基于物理基础的光线追踪RGBD序列生成的合成水下基准数据集,用于事件驱动光流。通过现代视频到事件管道应用于渲染出的水下视频中,我们产生了具有密集的真实地面实况流动、深度和摄像机运动的数据流。 此外,我们对最先进的学习型和模型驱动的光流预测方法进行了基准测试,以了解水下光线传输如何影响事件形成及运动估计精度。我们的数据集为未来水下事件感知算法的发展与评估建立了新的基线标准。该项目的源代码和数据集在以下网址公开提供:[此链接](请将 [此链接] 替换实际的URL)。
https://arxiv.org/abs/2601.10054
Audio-visual semantic segmentation (AVSS) represents an extension of the audio-visual segmentation (AVS) task, necessitating a semantic understanding of audio-visual scenes beyond merely identifying sound-emitting objects at the visual pixel level. Contrary to a previous methodology, by decomposing the AVSS task into two discrete subtasks by initially providing a prompted segmentation mask to facilitate subsequent semantic analysis, our approach innovates on this foundational strategy. We introduce a novel collaborative framework, \textit{S}tepping \textit{S}tone \textit{P}lus (SSP), which integrates optical flow and textual prompts to assist the segmentation process. In scenarios where sound sources frequently coexist with moving objects, our pre-mask technique leverages optical flow to capture motion dynamics, providing essential temporal context for precise segmentation. To address the challenge posed by stationary sound-emitting objects, such as alarm clocks, SSP incorporates two specific textual prompts: one identifies the category of the sound-emitting object, and the other provides a broader description of the scene. Additionally, we implement a visual-textual alignment module (VTA) to facilitate cross-modal integration, delivering more coherent and contextually relevant semantic interpretations. Our training regimen involves a post-mask technique aimed at compelling the model to learn the diagram of the optical flow. Experimental results demonstrate that SSP outperforms existing AVS methods, delivering efficient and precise segmentation results.
音频-视觉语义分割(AVSS)是音频-视觉分割(AVS)任务的扩展,它要求对音频-视觉场景进行语义理解,而不仅仅是识别发出声音的对象在视觉像素层面的位置。与先前的方法不同,我们通过将AVSS任务分解为两个独立的子任务,并首先提供一个提示性分段掩模以促进后续语义分析来创新这一基础策略。为此,我们引入了一个新颖的合作框架——Stepping Stone Plus(SSP),该框架集成了光流和文本提示以辅助分割过程。 在声源经常与移动对象共存的情况下,我们的预分段技术利用光流捕捉运动动态,为精确的分割提供必要的时间上下文。为了应对静止发声物体的挑战,如闹钟,SSP引入了两个特定的文本提示:一个用于识别发出声音的对象类别,另一个则提供了场景的更广泛描述。此外,我们实施了一个视觉-文本对齐模块(VTA),以促进跨模态整合,提供更为连贯且上下文相关的语义解释。 我们的训练方案包括一种后分段技术,旨在促使模型学习光流图的结构。实验结果表明,SSP优于现有的AVS方法,在效率和精度方面均表现出色。
https://arxiv.org/abs/2601.08133
Facial optical flow supports a wide range of tasks in facial motion analysis. However, the lack of high-resolution facial optical flow datasets has hindered progress in this area. In this paper, we introduce Splatting Rasterization Flow (SRFlow), a high-resolution facial optical flow dataset, and Splatting Rasterization Guided FlowNet (SRFlowNet), a facial optical flow model with tailored regularization losses. These losses constrain flow predictions using masks and gradients computed via difference or Sobel operator. This effectively suppresses high-frequency noise and large-scale errors in texture-less or repetitive-pattern regions, enabling SRFlowNet to be the first model explicitly capable of capturing high-resolution skin motion guided by Gaussian splatting rasterization. Experiments show that training with the SRFlow dataset improves facial optical flow estimation across various optical flow models, reducing end-point error (EPE) by up to 42% (from 0.5081 to 0.2953). Furthermore, when coupled with the SRFlow dataset, SRFlowNet achieves up to a 48% improvement in F1-score (from 0.4733 to 0.6947) on a composite of three micro-expression datasets. These results demonstrate the value of advancing both facial optical flow estimation and micro-expression recognition.
面部光学流支持面部运动分析中的广泛任务。然而,缺乏高分辨率的面部光学流数据集阻碍了这一领域的发展。在本文中,我们介绍了Splatting Rasterization Flow (SRFlow),这是一个高分辨率的面部光学流数据集,以及Splatting Rasterization Guided FlowNet (SRFlowNet),这是一种专为定制正则化损失而设计的面部光学流模型。这些损失通过使用掩码和差分或Sobel算子计算的梯度来约束光流预测,从而有效地抑制了无纹理或重复图案区域中的高频噪声和大规模误差。这使得SRFlowNet成为首个能够捕捉由高斯点染光栅化引导下的高分辨率皮肤运动的模型。 实验表明,使用SRFlow数据集进行训练可以提高各种光学流模型的面部光学流估计精度,将末端点误差(EPE)最多降低42%(从0.5081降至0.2953)。此外,在与SRFlow数据集结合时,SRFlowNet在三个微表情数据集组成的综合评估中实现了高达48%的F1分数改进(从0.4733提高到0.6947)。 这些结果展示了推进面部光学流估计和微表情识别的价值。
https://arxiv.org/abs/2601.06479
We present MOSAIC-GS, a novel, fully explicit, and computationally efficient approach for high-fidelity dynamic scene reconstruction from monocular videos using Gaussian Splatting. Monocular reconstruction is inherently ill-posed due to the lack of sufficient multiview constraints, making accurate recovery of object geometry and temporal coherence particularly challenging. To address this, we leverage multiple geometric cues, such as depth, optical flow, dynamic object segmentation, and point tracking. Combined with rigidity-based motion constraints, these cues allow us to estimate preliminary 3D scene dynamics during an initialization stage. Recovering scene dynamics prior to the photometric optimization reduces reliance on motion inference from visual appearance alone, which is often ambiguous in monocular settings. To enable compact representations, fast training, and real-time rendering while supporting non-rigid deformations, the scene is decomposed into static and dynamic components. Each Gaussian in the dynamic part of the scene is assigned a trajectory represented as time-dependent Poly-Fourier curve for parameter-efficient motion encoding. We demonstrate that MOSAIC-GS achieves substantially faster optimization and rendering compared to existing methods, while maintaining reconstruction quality on par with state-of-the-art approaches across standard monocular dynamic scene benchmarks.
我们介绍了MOSAIC-GS,这是一种新颖的、完全显式的计算效率高的方法,用于从单目视频中进行高保真动态场景重建。这种方法利用了高斯点积(Gaussian Splatting)技术。由于缺乏足够的多视角约束条件,单目重建本质上是一个病态问题,这使得准确恢复物体几何形状和时间一致性变得特别具有挑战性。为了解决这一问题,我们采用了多种几何线索,例如深度信息、光学流、动态对象分割以及点跟踪。结合基于刚体的运动限制,这些线索使我们在初始化阶段能够估计出初步的3D场景动力学。 在光度优化之前恢复场景的动力学减少了对仅从视觉外观推断运动的依赖,在单目设置中这种推断往往是模棱两可的。为了实现紧凑表示、快速训练和实时渲染,并支持非刚性变形,场景被分解为静态和动态部分。每个动态部分中的高斯点都分配了一个由时间相关的Poly-Fourier曲线表示的轨迹,以进行参数高效的运动编码。 我们证明了MOSAIC-GS在与现有方法相比时,在标准单目动态场景基准测试中实现了显著更快的优化和渲染速度,同时保持了与最先进的重建质量相当的水平。
https://arxiv.org/abs/2601.05368
Human detection in videos plays an important role in various real-life applications. Most traditional approaches depend on utilizing handcrafted features, which are problem-dependent and optimal for specific tasks. Moreover, they are highly susceptible to dynamical events such as illumination changes, camera jitter, and variations in object sizes. On the other hand, the proposed feature learning approaches are cheaper and easier because highly abstract and discriminative features can be produced automatically without the need of expert knowledge. In this paper, we utilize automatic feature learning methods, which combine optical flow and three different deep models (i.e., supervised convolutional neural network (S-CNN), pretrained CNN feature extractor, and hierarchical extreme learning machine) for human detection in videos captured using a nonstatic camera on an aerial platform with varying altitudes. The models are trained and tested on the publicly available and highly challenging UCF-ARG aerial dataset. The comparison between these models in terms of training, testing accuracy, and learning speed is analyzed. The performance evaluation considers five human actions (digging, waving, throwing, walking, and running). Experimental results demonstrated that the proposed methods are successful for the human detection task. The pretrained CNN produces an average accuracy of 98.09%. S-CNN produces an average accuracy of 95.6% with softmax and 91.7% with Support Vector Machines (SVM). H-ELM has an average accuracy of 95.9%. Using a normal Central Processing Unit (CPU), H-ELM's training time takes 445 seconds. Learning in S-CNN takes 770 seconds with a high-performance Graphical Processing Unit (GPU).
视频中的人体检测在各种现实应用中起着重要作用。大多数传统方法依赖于利用手工制作的特征,这些特征是问题相关的,并且对于特定任务是最优的。此外,它们对动态事件(如照明变化、摄像机抖动以及物体大小的变化)非常敏感。另一方面,所提出的特征学习方法成本较低且更易于实现,因为高度抽象和区分性的特征可以自动产生而无需专家知识。 在本文中,我们利用了自动特征学习的方法,这些方法结合了光流与三种不同的深度模型(即监督卷积神经网络(S-CNN)、预训练的CNN特征提取器以及分层极端学习机)来检测使用非静态摄像头从不同高度进行空中拍摄时视频中的行人。我们在公开且极具挑战性的UCF-ARG空中数据集上对这些模型进行了训练和测试,并分析了它们在训练准确性、测试准确性和学习速度方面的比较情况。性能评估考虑到了五种人类动作(挖掘、挥手、投掷、行走和跑步)。 实验结果显示,所提出的方法对于人体检测任务是成功的。预训练的CNN平均准确率为98.09%。S-CNN使用softmax时的平均准确率为95.6%,而使用支持向量机(SVM)时为91.7%。H-ELM的平均准确率为95.9%。在普通的中央处理器(CPU)上,H-ELM的训练时间需要445秒;而在高性能图形处理单元(GPU)上,S-CNN的学习时间为770秒。 这些结果表明,在视频中进行人体检测任务时,自动特征学习方法结合深度模型能够提供有效的解决方案,并且在效率和准确性方面表现出色。
https://arxiv.org/abs/2601.00391
Existing open-source film restoration methods show limited performance compared to commercial methods due to training with low-quality synthetic data and employing noisy optical flows. In addition, high-resolution films have not been explored by the open-source this http URL propose HaineiFRDM(Film Restoration Diffusion Model), a film restoration framework, to explore diffusion model's powerful content-understanding ability to help human expert better restore indistinguishable film this http URL, we employ a patch-wise training and testing strategy to make restoring high-resolution films on one 24GB-VRAMR GPU possible and design a position-aware Global Prompt and Frame Fusion this http URL, we introduce a global-local frequency module to reconstruct consistent textures among different patches. Besides, we firstly restore a low-resolution result and use it as global residual to mitigate blocky artifacts caused by patching this http URL, we construct a film restoration dataset that contains restored real-degraded films and realistic synthetic this http URL experimental results conclusively demonstrate the superiority of our model in defect restoration ability over existing open-source methods. Code and the dataset will be released.
现有的开源电影修复方法由于使用低质量的合成数据进行训练和采用噪声光学流,其性能相比商业方法较为有限。此外,高分辨率影片的修复尚未被开源社区充分探索。为此,我们提出了HaineiFRDM(Film Restoration Diffusion Model),这是一种基于扩散模型的电影修复框架,旨在利用扩散模型强大的内容理解能力帮助人类专家更好地恢复难以辨别的电影片段。 具体来说,我们采用了一种基于补丁的方法进行训练和测试,使得在配备24GB VRAM GPU上对高分辨率影片进行修复成为可能,并设计了一个位置感知全局提示及帧融合模块。此外,为了重建不同补丁间的连贯纹理,我们引入了全球-局部频率模块。 进一步地,我们在首次恢复低分辨率结果后将其用作全局残差以减轻由分块引起的方块效应(blocky artifacts)。我们构建了一个包含修复的现实降质影片和逼真的合成数据的电影修复数据集。实验结果显示,我们的模型在缺陷修复能力上显著优于现有的开源方法。代码及数据集将公开发布。
https://arxiv.org/abs/2512.24946
Reinforcement Learning (RL) is crucial for empowering VideoLLMs with complex spatiotemporal reasoning. However, current RL paradigms predominantly rely on random data shuffling or naive curriculum strategies based on scalar difficulty metrics. We argue that scalar metrics fail to disentangle two orthogonal challenges in video understanding: Visual Temporal Perception Load and Cognitive Reasoning Depth. To address this, we propose VideoCuRL, a novel framework that decomposes difficulty into these two axes. We employ efficient, training-free proxies, optical flow and keyframe entropy for visual complexity, Calibrated Surprisal for cognitive complexity, to map data onto a 2D curriculum grid. A competence aware Diagonal Wavefront strategy then schedules training from base alignment to complex reasoning. Furthermore, we introduce Dynamic Sparse KL and Structured Revisiting to stabilize training against reward collapse and catastrophic forgetting. Extensive experiments show that VideoCuRL surpasses strong RL baselines on reasoning (+2.5 on VSI-Bench) and perception (+2.9 on VideoMME) tasks. Notably, VideoCuRL eliminates the prohibitive inference overhead of generation-based curricula, offering a scalable solution for robust video post-training.
强化学习(RL)对于赋予VideoLLMs复杂时空推理能力至关重要。然而,目前的RL范式主要依赖于随机数据洗牌或基于标量难度指标的简单课程策略。我们认为,标量度量无法分离视频理解中的两个正交挑战:视觉时间感知负载和认知推理深度。为解决这一问题,我们提出了VideoCuRL,这是一个将困难性分解为这两轴的新框架。对于视觉复杂性的训练免费代理,我们使用了高效的光学流和关键帧熵;而对于认知复杂性,我们采用了校准的意外度量,并将其映射到一个2D课程网格上。之后采用能力感知对角波前策略从基本对齐训练至复杂的推理。此外,为了在面对奖励崩溃和灾难性遗忘时稳定训练,我们引入了动态稀疏KL和结构化回访机制。广泛的实验表明,在推理(VSI-Bench上的+2.5)和感知(VideoMME上的+2.9)任务上,VideoCuRL超越了强大的RL基线模型。值得注意的是,VideoCuRL消除了生成式课程的繁重推断开销,提供了一个可扩展的解决方案以增强视频训练后的稳健性。
https://arxiv.org/abs/2601.00887
Long-form video editing poses unique challenges due to the exponential increase in the computational cost from joint editing and Denoising Diffusion Implicit Models (DDIM) inversion across extended sequences. To address these limitations, we propose PipeFlow, a scalable, pipelined video editing method that introduces three key innovations: First, based on a motion analysis using Structural Similarity Index Measure (SSIM) and Optical Flow, we identify and propose to skip editing of frames with low motion. Second, we propose a pipelined task scheduling algorithm that splits a video into multiple segments and performs DDIM inversion and joint editing in parallel based on available GPU memory. Lastly, we leverage a neural network-based interpolation technique to smooth out the border frames between segments and interpolate the previously skipped frames. Our method uniquely scales to longer videos by dividing them into smaller segments, allowing PipeFlow's editing time to increase linearly with video length. In principle, this enables editing of infinitely long videos without the growing per-frame computational overhead encountered by other methods. PipeFlow achieves up to a 9.6X speedup compared to TokenFlow and a 31.7X speedup over Diffusion Motion Transfer (DMT).
长视频编辑面临着独特的挑战,因为联合编辑和Denoising Diffusion Implicit Models(DDIM)逆向处理在扩展序列上会导致计算成本呈指数级增长。为了应对这些限制,我们提出了PipeFlow,这是一种可扩展的流水线式视频编辑方法,引入了三个关键创新: 首先,基于使用结构相似性指标(SSIM)和光流进行运动分析,我们可以识别并建议跳过低运动帧的编辑。 其次,我们提出了一种流水线任务调度算法,该算法将视频分成多个片段,并根据可用的GPU内存进行DDIM逆向处理和联合编辑操作。这种方法可以在多段上并行执行这些操作。 最后,我们采用基于神经网络的技术来平滑分段间的边界帧,并且插值那些之前被跳过的帧。 我们的方法通过将长视频分割成较小的部分,使其能够独特地扩展到更长时间的视频处理中。这使得PipeFlow的编辑时间线性增加与视频长度有关,原则上可以实现无限长视频的编辑而不会遇到其他方法随帧数增长的计算开销问题。 实验表明,与TokenFlow相比,PipeFlow的速度提高了高达9.6倍;与Diffusion Motion Transfer (DMT)相比,则有31.7倍的加速效果。
https://arxiv.org/abs/2512.24026
Sea Surface Temperature (SST) prediction plays a vital role in climate modeling and disaster forecasting. However, it remains challenging due to its nonlinear spatiotemporal dynamics and extended prediction horizons. To address this, we propose OptFormer, a novel encoder-decoder model that integrates phase-space reconstruction with a motion-aware attention mechanism guided by optical flow. Unlike conventional attention, our approach leverages inter-frame motion cues to highlight relative changes in the spatial field, allowing the model to focus on dynamic regions and capture long-range temporal dependencies more effectively. Experiments on NOAA SST datasets across multiple spatial scales demonstrate that OptFormer achieves superior performance under a 1:1 training-to-prediction setting, significantly outperforming existing baselines in accuracy and robustness.
海面温度(SST)预测在气候建模和灾害预报中扮演着重要角色。然而,由于其非线性的时空动态变化以及较长的预测周期,这一任务仍然具有挑战性。为此,我们提出了OptFormer模型,这是一种新颖的编码器-解码器架构,它结合了相空间重构与基于光流引导的运动感知注意机制。不同于传统的注意力方法,我们的方法利用帧间运动线索来强调空间场中的相对变化,从而使模型能够专注于动态区域并更有效地捕捉长期时间依赖关系。 在NOAA SST数据集上进行的多尺度实验表明,在1:1训练与预测设置下,OptFormer达到了优于现有基线模型的表现,在准确性和鲁棒性方面有显著提升。
https://arxiv.org/abs/2601.06078
This paper presents the design and implementation of a relative localization system for SnailBot, a modular self reconfigurable robot. The system integrates ArUco marker recognition, optical flow analysis, and IMU data processing into a unified fusion framework, enabling robust and accurate relative positioning for collaborative robotic tasks. Experimental validation demonstrates the effectiveness of the system in realtime operation, with a rule based fusion strategy ensuring reliability across dynamic scenarios. The results highlight the potential for scalable deployment in modular robotic systems.
本文介绍了SnailBot(一种模块化自重构机器人)相对定位系统的设计与实现。该系统将ArUco标记识别、光学流分析和惯性测量单元(IMU)数据处理整合到一个统一的融合框架中,从而实现了协作机器人任务中的稳健且精确的相对定位。实验验证表明,该系统在实时操作中表现有效,并通过基于规则的融合策略确保了其在动态场景下的可靠性。研究结果突显了该技术在模块化机器人系统中可扩展部署的潜力。
https://arxiv.org/abs/2512.21226
Object pose tracking is one of the pivotal technologies in multimedia, attracting ever-growing attention in recent years. Existing methods employing traditional cameras encounter numerous challenges such as motion blur, sensor noise, partial occlusion, and changing lighting conditions. The emerging bio-inspired sensors, particularly event cameras, possess advantages such as high dynamic range and low latency, which hold the potential to address the aforementioned challenges. In this work, we present an optical flow-guided 6DoF object pose tracking method with an event camera. A 2D-3D hybrid feature extraction strategy is firstly utilized to detect corners and edges from events and object models, which characterizes object motion precisely. Then, we search for the optical flow of corners by maximizing the event-associated probability within a spatio-temporal window, and establish the correlation between corners and edges guided by optical flow. Furthermore, by minimizing the distances between corners and edges, the 6DoF object pose is iteratively optimized to achieve continuous pose tracking. Experimental results of both simulated and real events demonstrate that our methods outperform event-based state-of-the-art methods in terms of both accuracy and robustness.
物体姿态跟踪是多媒体领域中的关键技术之一,近年来备受关注。现有的使用传统摄像头的方法面临诸如运动模糊、传感器噪声、部分遮挡和光照变化等众多挑战。新兴的仿生传感设备,特别是事件相机(event camera),具有高动态范围和低延迟的优点,有望解决上述问题。在本工作中,我们提出了一种基于事件相机的光流引导6自由度物体姿态跟踪方法。首先采用2D-3D混合特征提取策略,从事件和物体模型中检测出角点和边缘,精确表征了物体的运动。接着,在时空窗口内通过最大化与事件关联的概率来搜索角点的光流,并基于光流建立了角点和边缘之间的关联性。此外,通过最小化角点和边缘之间的距离,迭代优化6自由度物体姿态,实现连续的姿态跟踪。模拟数据和真实事件实验结果表明,我们的方法在精度和鲁棒性方面都优于现有的最先进的事件驱动方法。
https://arxiv.org/abs/2512.21053
Human language processing relies on the brain's capacity for predictive inference. We present a machine learning framework for decoding neural (EEG) responses to dynamic visual language stimuli in Deaf signers. Using coherence between neural signals and optical flow-derived motion features, we construct spatiotemporal representations of predictive neural dynamics. Through entropy-based feature selection, we identify frequency-specific neural signatures that differentiate interpretable linguistic input from linguistically disrupted (time-reversed) stimuli. Our results reveal distributed left-hemispheric and frontal low-frequency coherence as key features in language comprehension, with experience-dependent neural signatures correlating with age. This work demonstrates a novel multimodal approach for probing experience-driven generative models of perception in the brain.
人类语言处理依赖于大脑进行预测性推理的能力。我们提出了一种机器学习框架,用于解码聋人手语使用者对动态视觉语言刺激的神经(EEG)反应。通过将神经信号与基于光流的运动特征之间的相干性结合,我们构建了预测性神经动力学的空间时间表示。利用熵为基础的特征选择方法,我们识别出频率特异性的神经标志物,这些标志物能够区分可理解的语言输入和语言被打乱(时间倒置)的刺激。我们的研究结果揭示了左半球广泛分布以及额叶低频相干是语言理解的关键特征,并且随着经验积累而变化的神经标志物与年龄相关联。这项工作展示了探索大脑中由经验驱动的感知生成模型的新颖多模态方法。
https://arxiv.org/abs/2512.20929
Learning from videos offers a promising path toward generalist robots by providing rich visual and temporal priors beyond what real robot datasets contain. While existing video generative models produce impressive visual predictions, they are difficult to translate into low-level actions. Conversely, latent-action models better align videos with actions, but they typically operate at the single-step level and lack high-level planning capabilities. We bridge this gap by introducing Skill Abstraction from Optical Flow (SOF), a framework that learns latent skills from large collections of action-free videos. Our key idea is to learn a latent skill space through an intermediate representation based on optical flow that captures motion information aligned with both video dynamics and robot actions. By learning skills in this flow-based latent space, SOF enables high-level planning over video-derived skills and allows for easier translation of these skills into actions. Experiments show that our approach consistently improves performance in both multitask and long-horizon settings, demonstrating the ability to acquire and compose skills directly from raw visual data.
从视频中学习为通用机器人的开发提供了一条充满希望的道路,因为它提供了超出真实机器人数据集的丰富视觉和时间先验。尽管现有的视频生成模型能够产生令人印象深刻的视觉预测,但它们很难转化为低级别的行动指令。相反,基于潜在动作的模型更好地将视频与具体操作对齐,不过通常这些模型只在单步水平上运行,并且缺乏高级规划能力。为了弥合这一差距,我们引入了Skill Abstraction from Optical Flow (SOF)框架,该框架能够从大量的无动作视频集合中学习潜在技能。我们的核心理念是通过基于光学流的中间表示来学习潜在的技能空间,这种表示方法能捕捉与视频动态和机器人操作都对齐的运动信息。通过对这种基于流动的潜在空间中的技能进行学习,SOF能够在基于视频获取的技能上实现高级规划,并使这些技能更容易转化为具体行动。实验表明,我们的方法在多任务和长时展望设置中均表现出持续改进性能的能力,展示了直接从原始视觉数据中获得和组合技能的能力。
https://arxiv.org/abs/2512.20052
Recent advances in neural portrait animation have demonstrated remarked potential for applications in virtual avatars, telepresence, and digital content creation. However, traditional explicit warping approaches often struggle with accurate motion transfer or recovering missing regions, while recent attention-based warping methods, though effective, frequently suffer from high complexity and weak geometric grounding. To address these issues, we propose SynergyWarpNet, an attention-guided cooperative warping framework designed for high-fidelity talking head synthesis. Given a source portrait, a driving image, and a set of reference images, our model progressively refines the animation in three stages. First, an explicit warping module performs coarse spatial alignment between the source and driving image using 3D dense optical flow. Next, a reference-augmented correction module leverages cross-attention across 3D keypoints and texture features from multiple reference images to semantically complete occluded or distorted regions. Finally, a confidence-guided fusion module integrates the warped outputs with spatially-adaptive fusing, using a learned confidence map to balance structural alignment and visual consistency. Comprehensive evaluations on benchmark datasets demonstrate state-of-the-art performance.
最近在神经肖像动画领域的进展展示了其在虚拟化身、远程呈现和数字内容创作方面的重要潜力。然而,传统的显式扭曲方法通常难以实现精确的动作转移或恢复缺失区域,而基于注意力的扭曲方法虽然有效,但常常面临高复杂度和弱几何基础的问题。为了解决这些问题,我们提出了SynergyWarpNet,这是一种以注意力引导的合作扭曲框架,专门用于高质量的话筒头合成。 给定一个源肖像、一张驱动图像以及一系列参考图像,我们的模型通过三个阶段逐步细化动画效果: 1. **显式扭曲模块**:使用3D密集光流在源图像和驱动图像之间执行粗略的空间对齐。 2. **参考增强校正模块**:利用跨注意力机制,在多个参考图像的3D关键点和纹理特征之间进行交互,从而语义化地完成遮挡或变形区域。 3. **信心引导融合模块**:通过空间自适应融合方法整合扭曲后的输出,并使用学习到的信心图来平衡结构对齐与视觉一致性。 在基准数据集上的全面评估表明了该模型的最先进的性能。
https://arxiv.org/abs/2512.17331
Video frame prediction extrapolates future frames from previous frames, but suffers from prediction errors in dynamic scenes due to the lack of information about the next frame. Event cameras address this limitation by capturing per-pixel brightness changes asynchronously with high temporal resolution. Prior research on event-based video frame prediction has leveraged motion information from event data, often by predicting event-based optical flow and reconstructing frames via pixel warping. However, such approaches introduce holes and blurring when pixel displacement is inaccurate. To overcome this limitation, we propose DESSERT, a diffusion-based event-driven single-frame synthesis framework via residual training. Leveraging a pre-trained Stable Diffusion model, our method is trained on inter-frame residuals to ensure temporal consistency. The training pipeline consists of two stages: (1) an Event-to-Residual Alignment Variational Autoencoder (ER-VAE) that aligns the event frame between anchor and target frames with the corresponding residual, and (2) a diffusion model that denoises the residual latent conditioned on event data. Furthermore, we introduce Diverse-Length Temporal (DLT) augmentation, which improves robustness by training on frame segments of varying temporal lengths. Experimental results demonstrate that our method outperforms existing event-based reconstruction, image-based video frame prediction, event-based video frame prediction, and one-sided event-based video frame interpolation methods, producing sharper and more temporally consistent frame synthesis.
视频帧预测通过从前一帧中推断未来帧来进行,但在动态场景中会因为缺乏下一帧的信息而导致预测误差。事件相机(event cameras)通过异步捕捉每个像素的亮度变化来解决这一问题,从而提供高时间分辨率的数据。先前基于事件的数据进行视频帧预测的研究通常利用事件数据中的运动信息,例如通过预测基于事件的光流并使用像素扭曲重建帧的方式实现。然而,此类方法在像素位移不准确时会产生孔洞和模糊。 为了克服这些限制,我们提出了DESSERT框架,这是一个基于扩散模型(diffusion-based)并通过残差训练进行单帧合成的事件驱动框架。该方法利用预训练的Stable Diffusion模型,在帧间差异的基础上进行训练以确保时间一致性。训练流程分为两个阶段: 1. 事件到残差对齐变分自编码器(ER-VAE),将锚定帧和目标帧之间的事件帧与对应的残差点对齐。 2. 扩散模型,基于事件数据清理残差隐变量的噪声。 此外,我们还引入了多样时间长度增强技术(DLT augmentation),通过在不同时间长度的帧段上进行训练来提高鲁棒性。实验结果显示,我们的方法超过了现有的事件基重构、图像基视频帧预测、事件基视频帧预测以及单向事件基视频帧插值的方法,在生成更清晰且时间一致性更好的合成帧方面表现出色。
https://arxiv.org/abs/2512.17323