Paper Reading AI Learner

Flow-Anything: Learning Real-World Optical Flow Estimation from Large-Scale Single-view Images

2025-06-09 13:23:44
Yingping Liang, Ying Fu, Yutao Hu, Wenqi Shao, Jiaming Liu, Debing Zhang

Abstract

Optical flow estimation is a crucial subfield of computer vision, serving as a foundation for video tasks. However, the real-world robustness is limited by animated synthetic datasets for training. This introduces domain gaps when applied to real-world applications and limits the benefits of scaling up datasets. To address these challenges, we propose \textbf{Flow-Anything}, a large-scale data generation framework designed to learn optical flow estimation from any single-view images in the real world. We employ two effective steps to make data scaling-up promising. First, we convert a single-view image into a 3D representation using advanced monocular depth estimation networks. This allows us to render optical flow and novel view images under a virtual camera. Second, we develop an Object-Independent Volume Rendering module and a Depth-Aware Inpainting module to model the dynamic objects in the 3D representation. These two steps allow us to generate realistic datasets for training from large-scale single-view images, namely \textbf{FA-Flow Dataset}. For the first time, we demonstrate the benefits of generating optical flow training data from large-scale real-world images, outperforming the most advanced unsupervised methods and supervised methods on synthetic datasets. Moreover, our models serve as a foundation model and enhance the performance of various downstream video tasks.

Abstract (translated)

光流估计是计算机视觉中的一个重要子领域,为视频任务提供了基础。然而,由于使用动画合成数据集进行训练,其在现实世界中的鲁棒性受到了限制。这导致了应用到实际场景时出现领域差距,并且阻碍了通过扩大数据集来提升性能的效果。为了应对这些挑战,我们提出了**Flow-Anything**,这是一个大型的数据生成框架,旨在从真实世界的单视图图像中学习光流估计。我们采用了两个有效的步骤来使数据量的扩展变得可行。首先,我们利用先进的单目深度估计网络将一幅单视图图像转换为3D表示形式,这让我们能够通过虚拟相机渲染出光流和新视角下的图像。其次,我们开发了独立于物体的体积渲染模块以及基于深度感知的修复模块来模拟3D表示中的动态对象。这两个步骤使我们能够从大规模的单视图图像中生成用于训练的真实数据集,即**FA-Flow 数据集**。 首次展示了从大规模真实世界图像中生成光流训练数据的好处,在无监督方法和合成数据集上的有监督方法方面均超越了最先进的技术水平。此外,我们的模型作为基础模型增强了各种下游视频任务的性能。

URL

https://arxiv.org/abs/2506.07740

PDF

https://arxiv.org/pdf/2506.07740.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot