Abstract
Since the breakthrough of ChatGPT, large language models (LLMs) have garnered significant attention in the research community. With the development of LLMs, the question of text style transfer for conversational models has emerged as a natural extension, where chatbots may possess their own styles or even characters. However, standard evaluation metrics have not yet been established for this new settings. This paper aims to address this issue by proposing the LMStyle Benchmark, a novel evaluation framework applicable to chat-style text style transfer (C-TST), that can measure the quality of style transfer for LLMs in an automated and scalable manner. In addition to conventional style strength metrics, LMStyle Benchmark further considers a novel aspect of metrics called appropriateness, a high-level metrics take account of coherence, fluency and other implicit factors without the aid of reference samples. Our experiments demonstrate that the new evaluation methods introduced by LMStyle Benchmark have a higher correlation with human judgments in terms of appropriateness. Based on LMStyle Benchmark, we present a comprehensive list of evaluation results for popular LLMs, including LLaMA, Alpaca, and Vicuna, reflecting their stylistic properties, such as formality and sentiment strength, along with their appropriateness.
Abstract (translated)
自ChatGPT突破以来,大型语言模型(LLMs)在研究社区中引起了广泛关注。随着LLMs的发展,对于对话模型的文本风格迁移问题成为一个自然扩展,其中聊天机器人可能具有自己独特的风格,甚至角色。然而,对于这种新设置,尚未建立标准的评估指标。本文旨在通过提出LMStyle基准,一种适用于对话式文本风格迁移(C-TST)的新评估框架,来解决这个问题。除了传统的风格强度指标之外,LMStyle基准还考虑了一个新的指标,称为适用性,这是一个高级指标,没有参考样本的情况下,考虑了连贯性、流畅性等隐含因素。我们的实验结果表明,LMStyle基准引入的新评估方法与人类判断在适用性方面具有更高的相关性。基于LMStyle基准,我们为流行的LLMs提供了全面的评估结果,包括LLLaMA、Alpaca和Vicuna,反映了它们的文体性质(如正式性和情感强度)以及适用性。
URL
https://arxiv.org/abs/2403.08943