Abstract
Embodied navigation demands comprehensive scene understanding and precise spatial reasoning. While image-text models excel at interpreting pixel-level color and lighting cues, 3D-text models capture volumetric structure and spatial relationships. However, unified fusion approaches that jointly fuse 2D images, 3D point clouds, and textual instructions face challenges in limited availability of triple-modality data and difficulty resolving conflicting beliefs among modalities. In this work, we introduce CoNav, a collaborative cross-modal reasoning framework where a pretrained 3D-text model explicitly guides an image-text navigation agent by providing structured spatial-semantic knowledge to resolve ambiguities during navigation. Specifically, we introduce Cross-Modal Belief Alignment, which operationalizes this cross-modal guidance by simply sharing textual hypotheses from the 3D-text model to the navigation agent. Through lightweight fine-tuning on a small 2D-3D-text corpus, the navigation agent learns to integrate visual cues with spatial-semantic knowledge derived from the 3D-text model, enabling effective reasoning in embodied navigation. CoNav achieves significant improvements on four standard embodied navigation benchmarks (R2R, CVDN, REVERIE, SOON) and two spatial reasoning benchmarks (ScanQA, SQA3D). Moreover, under close navigation Success Rate, CoNav often generates shorter paths compared to other methods (as measured by SPL), showcasing the potential and challenges of fusing data from different modalities in embodied navigation. Project Page: this https URL
Abstract (translated)
嵌入式导航需要全面的场景理解和精确的空间推理能力。虽然图像-文本模型擅长解读像素级别的颜色和光照线索,而3D-文本模型则能捕捉体积结构和空间关系。然而,联合融合2D图像、3D点云以及文本指令的数据统一融合方法在处理三模态数据稀缺性和解决不同模式之间冲突信念的难题时面临挑战。为此,我们引入了CoNav,这是一个协作跨模态推理框架,在此框架中,预训练的3D-文本模型通过提供结构化的空间语义知识来明确指导图像-文本导航代理,从而在导航过程中解决模糊性问题。 具体而言,我们提出了跨模态信念对齐方法,该方法通过简单地从3D-文本模型共享文字假设给导航代理来实现这种跨模态引导。经过轻量级的微调,在一个小型2D-3D-文本语料库上训练后,导航代理可以学习将视觉线索与从3D-文本模型衍生的空间语义知识结合起来,从而在嵌入式导航中进行有效推理。 CoNav在四个标准嵌入式导航基准(R2R, CVDN, REVERIE, SOON)和两个空间推理基准(ScanQA, SQA3D)上取得了显著的改进。此外,在接近导航成功率的情况下,与其它方法相比(通过SPL测量),CoNav通常生成更短的路径,展示了在嵌入式导航中融合不同模态数据的能力及其面临的挑战。 项目主页:[此链接](https://this-url)
URL
https://arxiv.org/abs/2505.16663