Paper Reading AI Learner

VaseVQA-3D: Benchmarking 3D VLMs on Ancient Greek Pottery

2025-10-06 04:28:39
Nonghai Zhang, Zeyu Zhang, Jiazi Wang, Yang Zhao, Hao Tang

Abstract

Vision-Language Models (VLMs) have achieved significant progress in multimodal understanding tasks, demonstrating strong capabilities particularly in general tasks such as image captioning and visual reasoning. However, when dealing with specialized cultural heritage domains like 3D vase artifacts, existing models face severe data scarcity issues and insufficient domain knowledge limitations. Due to the lack of targeted training data, current VLMs struggle to effectively handle such culturally significant specialized tasks. To address these challenges, we propose the VaseVQA-3D dataset, which serves as the first 3D visual question answering dataset for ancient Greek pottery analysis, collecting 664 ancient Greek vase 3D models with corresponding question-answer data and establishing a complete data construction pipeline. We further develop the VaseVLM model, enhancing model performance in vase artifact analysis through domain-adaptive training. Experimental results validate the effectiveness of our approach, where we improve by 12.8% on R@1 metrics and by 6.6% on lexical similarity compared with previous state-of-the-art on the VaseVQA-3D dataset, significantly improving the recognition and understanding of 3D vase artifacts, providing new technical pathways for digital heritage preservation research.

Abstract (translated)

视觉语言模型(VLMs)在多模态理解任务中取得了显著进展,尤其在图像描述和视觉推理等通用任务上表现出强大的能力。然而,在处理像三维陶器文物这样的专门化文化遗产领域时,现有的模型面临着严重的数据稀缺问题和专业知识不足的限制。由于缺乏针对性的训练数据,当前的VLMs难以有效地应对这些具有文化意义的专业任务。 为了解决这些问题,我们提出了VaseVQA-3D 数据集,这是第一个用于古希腊陶器分析的三维视觉问答数据集,收集了664个与相应问题和答案相关的古希腊陶器三维模型,并建立了一个完整的数据构建管道。此外,我们还开发了 VaseVLM 模型,通过领域适应性训练增强了在陶器文物分析中的模型性能。 实验结果验证了我们的方法的有效性,在VaseVQA-3D 数据集上相较于之前的最先进成果提高了12.8%的R@1指标和6.6%的词汇相似度。这些改进显著提升了对三维陶器文物的认识与理解,为数字文化遗产保护研究提供了新的技术路径。

URL

https://arxiv.org/abs/2510.04479

PDF

https://arxiv.org/pdf/2510.04479.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot