Paper Reading AI Learner

Evaluate What You Can't Evaluate: Unassessable Generated Responses Quality

2023-05-24 02:52:48
Yongkang Liu, Shi Feng, Daling Wang, Yifei Zhang, Hinrich Schütze

Abstract

LLMs (large language models) such as ChatGPT have shown remarkable language understanding and generation capabilities. Although reference-free evaluators based on LLMs show better human alignment than traditional reference-based evaluators, there are many challenges in using reference-free evaluators based on LLMs. Reference-free evaluators are more suitable for open-ended examples with different semantics responses. But not all examples are open-ended. For closed-ended examples with unique correct semantic response, reference-free evaluators will still consider it high quality when giving a response that is inconsistent with the facts and the semantic of reference. In order to comprehensively evaluate the reliability of evaluators based on LLMs, we construct two adversarial meta-evaluation dialogue generation datasets KdConv-ADV and DSTC7-ADV based on KdConv and DSTC7-AVSD, respectively. Compared to previous meta-evaluation benchmarks, KdConv-ADV and DSTC7-ADV are much more challenging since they requires evaluators to be able to reasonably evaluate closed-ended examples with the help of external knowledge or even its own knowledge. Empirical results show that the ability of LLMs to identify unreasonable responses is insufficient. There are risks in using eference-free evaluators based on LLMs to evaluate the quality of dialogue responses.

Abstract (translated)

大型语言模型(如ChatGPT)表现出非凡的语言理解和生成能力。尽管基于LLMs的标准评估者(即没有参考的评估者)比传统的基于参考的标准评估者表现更好的人类对齐,但在使用基于LLMs的标准评估者时仍面临许多挑战。基于LLMs的标准评估者更适合具有不同语义响应的开放式例子。但不是所有的例子都是开放式的。对于具有独特正确语义响应的封闭性例子,即使没有参考,基于LLMs的标准评估者仍然会认为其质量很高。为了全面评估基于LLMs的标准评估者的可靠性,我们分别基于KdConv和DSTC7-AVSD构建了两个对抗性元评估对话生成数据集KdConv-ADV和dstC7-ADV。与以前的元评估基准相比,KdConv-ADV和dstC7-ADV更具挑战性,因为它们要求评估者借助外部知识或甚至自己的知识合理评估封闭性例子。实证结果表明,LLMs识别不合理响应的能力还不够。使用基于LLMs的标准评估者评估对话响应的风险仍然存在。

URL

https://arxiv.org/abs/2305.14658

PDF

https://arxiv.org/pdf/2305.14658.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot