Abstract
How to eliminate pronominal reference in group chats? In this work, we have preprocessed 58k authentic chat data and manually annotated 2.3k questions. The reliability of this annotation was confirmed by the scaling law. After this, we conducted fine-tuning on Qwen models, ranging from 0.5B to 32B parameters. The optimal version improved 29.07 in F1 score. This confirms the viability of fine-tuning Large Language Model (LLM) for downstream Natural Language Processing (NLP) tasks. Our contributions are: 1) Created Supervised Fine-Tuning (SFT) training data in alpaca format, along with a set of Low-Rank Adaptation (LoRA) weights, and 2) Developed a method for acquiring high-quality data leveraging scaling law principle. The script, raw data with alpaca format and experiments track are open-sourced on Github this https URL, HuggingFace this https URL and WandB this https URL . The privacy of the data involved has been authorized by users.
Abstract (translated)
如何消除群聊中的代词引用?在这项工作中,我们预处理了58k个真实聊天数据,并手动标注了2.3k个问题。这个注释的可靠性通过缩放定律得到了确认。在这一点之后,我们对Qwen模型进行了微调,从0.5B到32B参数。最佳版本提高了29.07的F1得分。这证实了微调大语言模型(LLM)对于下游自然语言处理(NLP)任务的可用性。我们的贡献是:1)在Alpaca格式下创建了监督微调(SFT)训练数据集,并附带了一组低秩适应(LoRA)权重;2)利用缩放定律原理开发了一种获取高质量数据的方法。脚本、原始数据(以Alpaca格式)和实验跟踪都已开源在Github(https://github.com/)、HuggingFace(https://github.com/)和WandB(https://github.com/)上。用户参与其中的数据的隐私已获得授权。
URL
https://arxiv.org/abs/2405.02817