Paper Reading AI Learner

HuixiangDou-CR: Coreference Resolution in Group Chats

2024-05-05 05:43:20
Huanjun Kong

Abstract

How to eliminate pronominal reference in group chats? In this work, we have preprocessed 58k authentic chat data and manually annotated 2.3k questions. The reliability of this annotation was confirmed by the scaling law. After this, we conducted fine-tuning on Qwen models, ranging from 0.5B to 32B parameters. The optimal version improved 29.07 in F1 score. This confirms the viability of fine-tuning Large Language Model (LLM) for downstream Natural Language Processing (NLP) tasks. Our contributions are: 1) Created Supervised Fine-Tuning (SFT) training data in alpaca format, along with a set of Low-Rank Adaptation (LoRA) weights, and 2) Developed a method for acquiring high-quality data leveraging scaling law principle. The script, raw data with alpaca format and experiments track are open-sourced on Github this https URL, HuggingFace this https URL and WandB this https URL . The privacy of the data involved has been authorized by users.

Abstract (translated)

如何消除群聊中的代词引用?在这项工作中,我们预处理了58k个真实聊天数据,并手动标注了2.3k个问题。这个注释的可靠性通过缩放定律得到了确认。在这一点之后,我们对Qwen模型进行了微调,从0.5B到32B参数。最佳版本提高了29.07的F1得分。这证实了微调大语言模型(LLM)对于下游自然语言处理(NLP)任务的可用性。我们的贡献是:1)在Alpaca格式下创建了监督微调(SFT)训练数据集,并附带了一组低秩适应(LoRA)权重;2)利用缩放定律原理开发了一种获取高质量数据的方法。脚本、原始数据(以Alpaca格式)和实验跟踪都已开源在Github(https://github.com/)、HuggingFace(https://github.com/)和WandB(https://github.com/)上。用户参与其中的数据的隐私已获得授权。

URL

https://arxiv.org/abs/2405.02817

PDF

https://arxiv.org/pdf/2405.02817.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot