Paper Reading AI Learner

LMStyle Benchmark: Evaluating Text Style Transfer for Chatbots

2024-03-13 20:19:30
Jianlin Chen

Abstract

Since the breakthrough of ChatGPT, large language models (LLMs) have garnered significant attention in the research community. With the development of LLMs, the question of text style transfer for conversational models has emerged as a natural extension, where chatbots may possess their own styles or even characters. However, standard evaluation metrics have not yet been established for this new settings. This paper aims to address this issue by proposing the LMStyle Benchmark, a novel evaluation framework applicable to chat-style text style transfer (C-TST), that can measure the quality of style transfer for LLMs in an automated and scalable manner. In addition to conventional style strength metrics, LMStyle Benchmark further considers a novel aspect of metrics called appropriateness, a high-level metrics take account of coherence, fluency and other implicit factors without the aid of reference samples. Our experiments demonstrate that the new evaluation methods introduced by LMStyle Benchmark have a higher correlation with human judgments in terms of appropriateness. Based on LMStyle Benchmark, we present a comprehensive list of evaluation results for popular LLMs, including LLaMA, Alpaca, and Vicuna, reflecting their stylistic properties, such as formality and sentiment strength, along with their appropriateness.

Abstract (translated)

自ChatGPT突破以来,大型语言模型(LLMs)在研究社区中引起了广泛关注。随着LLMs的发展,对于对话模型的文本风格迁移问题成为一个自然扩展,其中聊天机器人可能具有自己独特的风格,甚至角色。然而,对于这种新设置,尚未建立标准的评估指标。本文旨在通过提出LMStyle基准,一种适用于对话式文本风格迁移(C-TST)的新评估框架,来解决这个问题。除了传统的风格强度指标之外,LMStyle基准还考虑了一个新的指标,称为适用性,这是一个高级指标,没有参考样本的情况下,考虑了连贯性、流畅性等隐含因素。我们的实验结果表明,LMStyle基准引入的新评估方法与人类判断在适用性方面具有更高的相关性。基于LMStyle基准,我们为流行的LLMs提供了全面的评估结果,包括LLLaMA、Alpaca和Vicuna,反映了它们的文体性质(如正式性和情感强度)以及适用性。

URL

https://arxiv.org/abs/2403.08943

PDF

https://arxiv.org/pdf/2403.08943.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot