Abstract
We present MetaFind, a scene-aware tri-modal compositional retrieval framework designed to enhance scene generation in the metaverse by retrieving 3D assets from large-scale repositories. MetaFind addresses two core challenges: (i) inconsistent asset retrieval that overlooks spatial, semantic, and stylistic constraints, and (ii) the absence of a standardized retrieval paradigm specifically tailored for 3D asset retrieval, as existing approaches mainly rely on general-purpose 3D shape representation models. Our key innovation is a flexible retrieval mechanism that supports arbitrary combinations of text, image, and 3D modalities as queries, enhancing spatial reasoning and style consistency by jointly modeling object-level features (including appearance) and scene-level layout structures. Methodologically, MetaFind introduces a plug-and-play equivariant layout encoder ESSGNN that captures spatial relationships and object appearance features, ensuring retrieved 3D assets are contextually and stylistically coherent with the existing scene, regardless of coordinate frame transformations. The framework supports iterative scene construction by continuously adapting retrieval results to current scene updates. Empirical evaluations demonstrate the improved spatial and stylistic consistency of MetaFind in various retrieval tasks compared to baseline methods.
Abstract (translated)
我们介绍了MetaFind,这是一个场景感知的三模态组合检索框架,旨在通过从大规模存储库中检索3D资产来增强元宇宙中的场景生成。MetaFind解决了两个核心挑战:(i) 不一致的资产检索忽略了空间、语义和风格约束;(ii) 缺乏专门针对3D资产检索的标准检索范式,因为现有方法主要依赖于通用的3D形状表示模型。 我们的关键技术创新在于提供了一种灵活的检索机制,支持任意组合文本、图像和3D模态作为查询,通过联合建模对象级别的特征(包括外观)和场景级别布局结构来增强空间推理和风格一致性。从方法论上看,MetaFind引入了一个即插即用等变布局编码器ESSGNN,能够捕捉空间关系和物体的外观特征,并确保检索到的3D资产在上下文和风格上与现有场景一致,无论坐标系转换如何。 该框架支持迭代场景构建,通过持续地将检索结果适应于当前场景更新。经验评估表明,在各种检索任务中,MetaFind相较于基线方法显著提升了空间和风格一致性。
URL
https://arxiv.org/abs/2510.04057