Paper Reading AI Learner

FlowDreamer: A RGB-D World Model with Flow-based Motion Representations for Robot Manipulation

2025-05-15 08:27:16
Jun Guo, Xiaojian Ma, Yikai Wang, Min Yang, Huaping Liu, Qing Li

Abstract

This paper investigates training better visual world models for robot manipulation, i.e., models that can predict future visual observations by conditioning on past frames and robot actions. Specifically, we consider world models that operate on RGB-D frames (RGB-D world models). As opposed to canonical approaches that handle dynamics prediction mostly implicitly and reconcile it with visual rendering in a single model, we introduce FlowDreamer, which adopts 3D scene flow as explicit motion representations. FlowDreamer first predicts 3D scene flow from past frame and action conditions with a U-Net, and then a diffusion model will predict the future frame utilizing the scene flow. FlowDreamer is trained end-to-end despite its modularized nature. We conduct experiments on 4 different benchmarks, covering both video prediction and visual planning tasks. The results demonstrate that FlowDreamer achieves better performance compared to other baseline RGB-D world models by 7% on semantic similarity, 11% on pixel quality, and 6% on success rate in various robot manipulation domains.

Abstract (translated)

这篇论文探讨了为机器人操作训练更优质的视觉世界模型的方法,即能够根据过去的帧和机器人的动作来预测未来的视觉观察结果的模型。具体来说,我们研究的是在RGB-D图像(RGB-D世界模型)上运行的世界模型。与传统的大多数处理动态预测的方法不同,这些方法大多是隐式的,并将这种预测与可视渲染结合在一个单一的模型中进行整合,我们引入了一种名为FlowDreamer的新方法,它采用3D场景流作为显式运动表示。 FlowDreamer首先使用U-Net从过去的帧和动作条件中预测3D场景流,然后一个扩散模型会利用这些场景流来预测未来的帧。尽管具有模块化结构,FlowDreamer仍然可以进行端到端的训练。我们在4个不同的基准测试上进行了实验,涵盖了视频预测和视觉规划任务。结果表明,在各种机器人操作领域中,FlowDreamer在语义相似性、像素质量和成功率方面分别比其他基线RGB-D世界模型提高了7%、11%和6%。

URL

https://arxiv.org/abs/2505.10075

PDF

https://arxiv.org/pdf/2505.10075.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot