Abstract
Despite rapid advances in multimodal large language models, agricultural applications remain constrained by the scarcity of domain-tailored models, curated vision-language corpora, and rigorous evaluation. To address these challenges, we present the AgriGPT-VL Suite, a unified multimodal framework for agriculture. Our contributions are threefold. First, we introduce Agri-3M-VL, the largest vision-language corpus for agriculture to our knowledge, curated by a scalable multi-agent data generator; it comprises 1M image-caption pairs, 2M image-grounded VQA pairs, 50K expert-level VQA instances, and 15K GRPO reinforcement learning samples. Second, we develop AgriGPT-VL, an agriculture-specialized vision-language model trained via a progressive curriculum of textual grounding, multimodal shallow/deep alignment, and GRPO refinement. This method achieves strong multimodal reasoning while preserving text-only capability. Third, we establish AgriBench-VL-4K, a compact yet challenging evaluation suite with open-ended and image-grounded questions, paired with multi-metric evaluation and an LLM-as-a-judge framework. Experiments show that AgriGPT-VL outperforms leading general-purpose VLMs on AgriBench-VL-4K, achieving higher pairwise win rates in the LLM-as-a-judge evaluation. Meanwhile, it remains competitive on the text-only AgriBench-13K with no noticeable degradation of language ability. Ablation studies further confirm consistent gains from our alignment and GRPO refinement stages. We will open source all of the resources to support reproducible research and deployment in low-resource agricultural settings.
Abstract (translated)
尽管多模态大型语言模型取得了快速进展,但农业应用仍受到领域特定模型稀缺、精心策划的视觉-语言语料库不足以及严格评估手段缺乏的限制。为了解决这些挑战,我们提出了AgriGPT-VL套件,这是一个统一的用于农业的多模态框架。我们的贡献有三个方面: 首先,我们介绍了Agri-3M-VL,这是据我们所知最大的针对农业领域的视觉-语言语料库,由一个可扩展的多代理数据生成器策划而成;它包含100万张图像-描述配对、200万张基于图像的问题回答(VQA)配对、5万个专家级别的VQA实例以及1.5万个GRPO强化学习样本。 其次,我们开发了AgriGPT-VL,这是一个专门针对农业领域的视觉语言模型,通过文本定位、多模态浅层/深层对齐及GRPO优化的渐进式课程训练方法进行培训。这种方法实现了强大的跨模态推理能力同时保持了纯文本处理的能力。 第三,我们建立了AgriBench-VL-4K,这是一个简洁但具有挑战性的评估套件,包括开放式和基于图像的问题,并配有多维指标评估及大型语言模型作为裁判的框架。实验表明,AgriGPT-VL在AgriBench-VL-4K上超越了通用型视觉语言模型(VLMs),在作为裁判的大型语言模型评估中获得了更高的对局胜率。与此同时,在仅限文本的任务集AgriBench-13K上,它仍表现出竞争力,没有明显的语言能力退化现象。消融研究进一步证实了我们对齐及GRPO优化阶段的一致性增益。 我们将开源所有资源以支持可重复的研究和在低资源农业环境中的部署。
URL
https://arxiv.org/abs/2510.04002