Abstract
We introduce LLaVA-Critic, the first open-source large multimodal model (LMM) designed as a generalist evaluator to assess performance across a wide range of multimodal tasks. LLaVA-Critic is trained using a high-quality critic instruction-following dataset that incorporates diverse evaluation criteria and scenarios. Our experiments demonstrate the model's effectiveness in two key areas: (1) LMM-as-a-Judge, where LLaVA-Critic provides reliable evaluation scores, performing on par with or surpassing GPT models on multiple evaluation benchmarks; and (2) Preference Learning, where it generates reward signals for preference learning, enhancing model alignment capabilities. This work underscores the potential of open-source LMMs in self-critique and evaluation, setting the stage for future research into scalable, superhuman alignment feedback mechanisms for LMMs.
Abstract (translated)
我们提出了LLaVA-Critic,第一个开源的大多模态模型(LMM),旨在作为通才评估器评估各种多模态任务的性能。LLaVA-Critic使用高质量的评论指令跟随数据集进行训练,该数据集包含了各种评估标准和场景。我们的实验证明了模型在两个关键领域中的有效性:(1)LLM作为评判者,LLaVA-Critic提供可靠的评估分数,在多个评估基准上与或超过GPT模型;和(2)偏好学习,为偏好学习生成奖励信号,提高模型对齐能力。这项工作突出了开源LMM在自批评和评估方面的潜力,为未来研究铺平了道路,深入研究可扩展的超人类对齐反馈机制。
URL
https://arxiv.org/abs/2410.02712