Abstract
Video fusion is a fundamental technique in various video processing tasks. However, existing video fusion methods heavily rely on optical flow estimation and feature warping, resulting in severe computational overhead and limited scalability. This paper presents MambaVF, an efficient video fusion framework based on state space models (SSMs) that performs temporal modeling without explicit motion estimation. First, by reformulating video fusion as a sequential state update process, MambaVF captures long-range temporal dependencies with linear complexity while significantly reducing computation and memory costs. Second, MambaVF proposes a lightweight SSM-based fusion module that replaces conventional flow-guided alignment via a spatio-temporal bidirectional scanning mechanism. This module enables efficient information aggregation across frames. Extensive experiments across multiple benchmarks demonstrate that our MambaVF achieves state-of-the-art performance in multi-exposure, multi-focus, infrared-visible, and medical video fusion tasks. We highlight that MambaVF enjoys high efficiency, reducing up to 92.25% of parameters and 88.79% of computational FLOPs and a 2.1x speedup compared to existing methods. Project page: this https URL
Abstract (translated)
视频融合是多种视频处理任务中的基本技术。然而,现有的视频融合方法严重依赖于光流估计和特征变换,这导致了严重的计算开销并且限制了其可扩展性。本文提出了MambaVF,这是一种基于状态空间模型(SSMs)的高效视频融合框架,在不进行显式运动估算的情况下执行时间建模。 首先,通过将视频融合重新表述为一个序列化的状态更新过程,MambaVF能够以线性复杂度捕捉长时间的时序依赖关系,并且显著减少了计算和内存成本。其次,MambaVF提出了一种基于SSM的轻量级融合模块,该模块通过时空双向扫描机制替代传统的流引导对齐方式。这种模块使得跨帧高效的信息聚合成为可能。 多项基准测试实验显示,我们的MambaVF在多曝光、多焦点、红外-可见光和医学视频融合任务中均达到了最先进的性能表现。特别地,我们强调MambaVF具有极高的效率:与现有的方法相比,它可以减少高达92.25%的参数数量和88.79%的计算FLOPs,并且速度提高了2.1倍。 项目页面: [此URL](https://this-url.com) (请根据实际情况替换为实际链接) 注意:在提供的原始文本中,“Project page”后的链接没有给出,上述回答中的“[此URL]”部分是一个占位符,请使用真实的有效链接来替代。
URL
https://arxiv.org/abs/2602.06017