Paper Reading AI Learner

POINTS1.5: Building a Vision-Language Model towards Real World Applications

2024-12-11 15:08:25
Yuan Liu, Le Tian, Xiao Zhou, Xinyu Gao, Kavio Yu, Yang Yu, Jie Zhou

Abstract

Vision-language models have made significant strides recently, demonstrating superior performance across a range of tasks, e.g. optical character recognition and complex diagram analysis. Building on this trend, we introduce a new vision-language model, POINTS1.5, designed to excel in various real-world applications. POINTS1.5 is an enhancement of POINTS1.0 and incorporates several key innovations: i) We replace the original CLIP vision encoder, which had a fixed image resolution, with a NaViT-style vision encoder that supports native dynamic high resolution. This allows POINTS1.5 to process images of any resolution without needing to split them into tiles. ii) We add bilingual support to POINTS1.5, significantly enhancing its capability in Chinese. Due to the scarcity of open-source Chinese datasets for vision-language models, we collect numerous images from the Internet and annotate them using a combination of manual and automatic methods. iii) We propose a set of rigorous filtering methods for visual instruction tuning datasets. We comprehensively evaluate all these filtering methods, and choose the most effective ones to obtain the final visual instruction tuning set. Thanks to these innovations, POINTS1.5 significantly outperforms POINTS1.0 and demonstrates strong performance across a range of real-world applications. Notably, POINTS1.5-7B is trained on fewer than 4 billion tokens and ranks first on the OpenCompass leaderboard among models with fewer than 10 billion parameters

Abstract (translated)

视觉语言模型最近取得了显著进展,在包括光学字符识别和复杂图表分析等一系列任务中展现了卓越的性能。基于这一趋势,我们介绍了一种新的视觉语言模型 POINTS1.5,该模型旨在在各种实际应用中表现出色。POINTS1.5 是对 POINTS1.0 的改进,并包含了几项关键创新:i) 我们用支持原生动态高分辨率的 NaViT 风格视觉编码器替换了原先固定图像分辨率的 CLIP 视觉编码器。这使得 POINTS1.5 能够处理任意分辨率的图像,而无需将它们分割成小块。ii) 我们为 POINTS1.5 添加了双语支持,显著提升了其中文能力。由于开源视觉语言模型中的中文数据集稀缺,我们从互联网上收集了许多图片,并通过结合手动和自动方法对其进行标注。iii) 我们提出了一套严格的过滤方法来处理视觉指令调优数据集。全面评估所有这些过滤方法后,选择了最有效的几种以获得最终的视觉指令调优集。得益于这些创新,POINTS1.5 显著超越了 POINTS1.0,并在一系列实际应用中表现出色。值得注意的是,在参数量少于 100 亿的模型中,训练数据少于 40 亿个 token 的 POINTS1.5-7B 在 OpenCompass 排行榜上排名第一。

URL

https://arxiv.org/abs/2412.08443

PDF

https://arxiv.org/pdf/2412.08443.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot