Paper Reading AI Learner

WildChat: 1M ChatGPT Interaction Logs in the Wild

2024-05-02 17:00:02
Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, Yuntian Deng


Chatbots such as GPT-4 and ChatGPT are now serving millions of users. Despite their widespread use, there remains a lack of public datasets showcasing how these tools are used by a population of users in practice. To bridge this gap, we offered free access to ChatGPT for online users in exchange for their affirmative, consensual opt-in to anonymously collect their chat transcripts and request headers. From this, we compiled WildChat, a corpus of 1 million user-ChatGPT conversations, which consists of over 2.5 million interaction turns. We compare WildChat with other popular user-chatbot interaction datasets, and find that our dataset offers the most diverse user prompts, contains the largest number of languages, and presents the richest variety of potentially toxic use-cases for researchers to study. In addition to timestamped chat transcripts, we enrich the dataset with demographic data, including state, country, and hashed IP addresses, alongside request headers. This augmentation allows for more detailed analysis of user behaviors across different geographical regions and temporal dimensions. Finally, because it captures a broad range of use cases, we demonstrate the dataset's potential utility in fine-tuning instruction-following models. WildChat is released at this https URL under AI2 ImpACT Licenses.

Abstract (translated)

像GPT-4和ChatGPT这样的聊天机器人现在服务着数百万用户。尽管它们在范围内得到了广泛应用,但目前还没有公开的数据集展示这些工具如何在实际用户中使用。为了填补这个空白,我们向在线用户免费提供ChatGPT,条件是他们同意匿名收集他们的聊天记录并请求头。从中,我们编写了WildChat,一个由100万用户与ChatGPT的对话组成的语料库,包括超过250万交互回合。我们比较WildChat与其他受欢迎的用户聊天机器人交互数据集,发现我们的数据集提供了最丰富的用户提示,包含了最多的语言,以及研究人员可以研究的最可能具有破坏性的用例的丰富多样性。除了时间戳化的聊天记录外,我们还通过包括人口统计学数据(包括州、国家 和哈希IP地址)来丰富这个数据集。这使得我们可以更详细地分析用户行为在不同地理区域和时间维度上的差异。最后,因为它涵盖了广泛的用例,我们证明了该数据集在微调指令跟随模型的潜在用途上的价值。WildChat现已在https://this URL上发布,符合AI2 ImpACT许可证。



3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot