Paper Reading AI Learner

MetaFind: Scene-Aware 3D Asset Retrieval for Coherent Metaverse Scene Generation

2025-10-05 06:37:26
Zhenyu Pan, Yucheng Lu, Han Liu

Abstract

We present MetaFind, a scene-aware tri-modal compositional retrieval framework designed to enhance scene generation in the metaverse by retrieving 3D assets from large-scale repositories. MetaFind addresses two core challenges: (i) inconsistent asset retrieval that overlooks spatial, semantic, and stylistic constraints, and (ii) the absence of a standardized retrieval paradigm specifically tailored for 3D asset retrieval, as existing approaches mainly rely on general-purpose 3D shape representation models. Our key innovation is a flexible retrieval mechanism that supports arbitrary combinations of text, image, and 3D modalities as queries, enhancing spatial reasoning and style consistency by jointly modeling object-level features (including appearance) and scene-level layout structures. Methodologically, MetaFind introduces a plug-and-play equivariant layout encoder ESSGNN that captures spatial relationships and object appearance features, ensuring retrieved 3D assets are contextually and stylistically coherent with the existing scene, regardless of coordinate frame transformations. The framework supports iterative scene construction by continuously adapting retrieval results to current scene updates. Empirical evaluations demonstrate the improved spatial and stylistic consistency of MetaFind in various retrieval tasks compared to baseline methods.

Abstract (translated)

我们介绍了MetaFind,这是一个场景感知的三模态组合检索框架,旨在通过从大规模存储库中检索3D资产来增强元宇宙中的场景生成。MetaFind解决了两个核心挑战:(i) 不一致的资产检索忽略了空间、语义和风格约束;(ii) 缺乏专门针对3D资产检索的标准检索范式,因为现有方法主要依赖于通用的3D形状表示模型。 我们的关键技术创新在于提供了一种灵活的检索机制,支持任意组合文本、图像和3D模态作为查询,通过联合建模对象级别的特征(包括外观)和场景级别布局结构来增强空间推理和风格一致性。从方法论上看,MetaFind引入了一个即插即用等变布局编码器ESSGNN,能够捕捉空间关系和物体的外观特征,并确保检索到的3D资产在上下文和风格上与现有场景一致,无论坐标系转换如何。 该框架支持迭代场景构建,通过持续地将检索结果适应于当前场景更新。经验评估表明,在各种检索任务中,MetaFind相较于基线方法显著提升了空间和风格一致性。

URL

https://arxiv.org/abs/2510.04057

PDF

https://arxiv.org/pdf/2510.04057.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot