Paper Reading AI Learner

Olaf-World: Orienting Latent Actions for Video World Modeling

2026-02-10 18:58:41
Yuxin Jiang, Yuchao Gu, Ivor W. Tsang, Mike Zheng Shou

Abstract

Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts: they entangle scene-specific cues and lack a shared coordinate system. This occurs because standard objectives operate only within each clip, providing no mechanism to align action semantics across contexts. Our key insight is that although actions are unobserved, their semantic effects are observable and can serve as a shared reference. We introduce Seq$\Delta$-REPA, a sequence-level control-effect alignment objective that anchors integrated latent action to temporal feature differences from a frozen, self-supervised video encoder. Building on this, we present Olaf-World, a pipeline that pretrains action-conditioned video world models from large-scale passive video. Extensive experiments demonstrate that our method learns a more structured latent action space, leading to stronger zero-shot action transfer and more data-efficient adaptation to new control interfaces than state-of-the-art baselines.

Abstract (translated)

规模可调的、受动作控制的世界模型受限于动作标签的稀缺性。虽然潜在动作学习承诺从无标签视频中提取控制接口,但所学得的潜在变量往往难以跨上下文转移:它们会纠缠特定场景中的线索,并缺乏一个共享的坐标系统。这是因为标准目标仅在每个片段内操作,没有机制来对齐不同上下文的动作语义。 我们的关键见解是:尽管动作不可见,其语义效果是可以观察到并能作为共同参考点使用的。我们引入了Seq$\Delta$-REPA,这是一个序列级的控制效果对准目标,它将整合后的潜在动作锚定于来自冻结的自监督视频编码器的时序特征差异上。 在此基础上,我们提出了Olaf-World,这是一种流水线方法,通过大规模被动视频预训练受动作条件约束的视频世界模型。大量的实验表明,我们的方法学习到了一个更结构化的潜在动作空间,在零样本动作迁移和对新控制接口的数据高效适应方面优于现有的最先进技术。

URL

https://arxiv.org/abs/2602.10104

PDF

https://arxiv.org/pdf/2602.10104.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot