We extend the context length of Llama-3-8B-Instruct from 8K to 80K via QLoRA fine-tuning. The entire training cycle is super efficient, which takes 8 hours on one 8xA800 (80G) GPU machine. The resulted model exhibits superior performances across a broad range of evaluation tasks, such as NIHS, topic retrieval, and long-context language understanding; meanwhile, it also well preserves the original capability over short contexts. The dramatic context extension is mainly attributed to merely 3.5K synthetic training samples generated by GPT-4 , which indicates the LLMs' inherent (yet largely underestimated) potential to extend its original context length. In fact, the context length could be extended far beyond 80K with more computation resources. Therefore, the team will publicly release the entire resources (including data, model, data generation pipeline, training code) so as to facilitate the future research from the community: \url{this https URL}.
我们通过 QLoRA 微调将Llama-3-8B-Instruct的上下文长度从8K扩展到80K。整个训练周期非常高效,只需在单个8xA800(80G)GPU机器上花费8小时。所得到的模型在广泛的评估任务中都表现出卓越的性能,例如NIHS、主题检索和长文本理解;同时,它还能够在短上下文中很好地保留原始功能。戏剧性的上下文扩展主要归功于由GPT-4生成的3.5K个合成训练样本,这表明LLM固有的(然而却被低估)延展原始上下文长度的潜力。实际上,使用更多的计算资源,上下文长度可以大大扩展。因此,团队将公开发布整个资源(包括数据、模型、数据生成管道和训练代码),以便于未来社区的研究:\url{this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <https://this <
https://arxiv.org/abs/2404.19553
Large Language Models (LLMs) have catalyzed significant advancements in Natural Language Processing (NLP), yet they encounter challenges such as hallucination and the need for domain-specific knowledge. To mitigate these, recent methodologies have integrated information retrieved from external resources with LLMs, substantially enhancing their performance across NLP tasks. This survey paper addresses the absence of a comprehensive overview on Retrieval-Augmented Language Models (RALMs), both Retrieval-Augmented Generation (RAG) and Retrieval-Augmented Understanding (RAU), providing an in-depth examination of their paradigm, evolution, taxonomy, and applications. The paper discusses the essential components of RALMs, including Retrievers, Language Models, and Augmentations, and how their interactions lead to diverse model structures and applications. RALMs demonstrate utility in a spectrum of tasks, from translation and dialogue systems to knowledge-intensive applications. The survey includes several evaluation methods of RALMs, emphasizing the importance of robustness, accuracy, and relevance in their assessment. It also acknowledges the limitations of RALMs, particularly in retrieval quality and computational efficiency, offering directions for future research. In conclusion, this survey aims to offer a structured insight into RALMs, their potential, and the avenues for their future development in NLP. The paper is supplemented with a Github Repository containing the surveyed works and resources for further study: this https URL.
大规模语言模型(LLMs)在自然语言处理(NLP)领域催生了许多显著的进步,但它们仍然面临诸如幻觉和需要领域特定知识等挑战。为了缓解这些挑战,最近的方法将外部资源中检索到的信息与LLM相结合,极大地提高了它们在NLP任务上的表现。 这份调查论文讨论了关于检索增强语言模型(RALMs)的全面概述的缺失,包括检索增强生成(RAG)和检索增强理解(RAU),深入研究了它们的范式、演变、分类和应用。论文讨论了RALMs的重要组成部分,包括检索器、语言模型和增强器,以及它们之间的互动导致的不同模型结构和应用。RALMs在翻译和对话系统以及知识密集型应用等领域表现出实际价值。 调查包括对RALMs的几个评估方法的讨论,强调了它们的稳健性、准确性和相关性在评估中的重要性。它还承认了RALMs的局限性,特别是检索质量和计算效率方面的限制,为未来的研究提供了方向。总之,这份调查试图提供一个结构化的了解RALMs、它们的潜力和在NLP领域未来发展的途径。论文还附带了一个Github存储库,其中包含调查的工作和资源:https://github.com/。
https://arxiv.org/abs/2404.19543
Understanding the non-literal meaning of an utterance is critical for large language models (LLMs) to become human-like social communicators. In this work, we introduce SwordsmanImp, the first Chinese multi-turn-dialogue-based dataset aimed at conversational implicature, sourced from dialogues in the Chinese sitcom $\textit{My Own Swordsman}$. It includes 200 carefully handcrafted questions, all annotated on which Gricean maxims have been violated. We test eight close-source and open-source LLMs under two tasks: a multiple-choice question task and an implicature explanation task. Our results show that GPT-4 attains human-level accuracy (94%) on multiple-choice questions. CausalLM demonstrates a 78.5% accuracy following GPT-4. Other models, including GPT-3.5 and several open-source models, demonstrate a lower accuracy ranging from 20% to 60% on multiple-choice questions. Human raters were asked to rate the explanation of the implicatures generated by LLMs on their reasonability, logic and fluency. While all models generate largely fluent and self-consistent text, their explanations score low on reasonability except for GPT-4, suggesting that most LLMs cannot produce satisfactory explanations of the implicatures in the conversation. Moreover, we find LLMs' performance does not vary significantly by Gricean maxims, suggesting that LLMs do not seem to process implicatures derived from different maxims differently. Our data and code are available at this https URL.
理解一个会话中的非字面意义对于大型语言模型(LLMs)成为具有人类水平的社交交际者至关重要。在这项工作中,我们引入了SwordsmanImp,第一个针对中国情景喜剧《我自己的刀剑侠》的中文多轮对话数据集,旨在实现会话含义。它包括200个精心制作的提问,所有这些都附有 Gricean maxims 的违约情况。我们对八个闭源和开源的LLM进行了两种任务测试:多选题问题和会话含义解释任务。我们的结果表明,GPT-4在多选题上的准确率达到了人类水平(94%)。CausalLM在GPT-4之后的准确率达到了78.5%。其他模型,包括GPT-3.5和几个开源模型,在多选题上的准确率较低,从20%到60%不等。人类评估者被要求根据LLM对会话含义的生成进行推理、逻辑和流畅性评分。虽然所有模型产生的文本都相当流畅且自相矛盾,但它们的解释得分在推理方面都很低,除了GPT-4,这表明大多数LLM无法产生满意的会话含义。此外,我们发现LLM的性能与Gricean maxims没有显著差异,这表明LLM似乎没有以不同的最大原则处理会话含义。我们的数据和代码可以从该链接的https:// URL中获取。
https://arxiv.org/abs/2404.19509
Two major areas of interest in the era of Large Language Models regard questions of what do LLMs know, and if and how they may be able to reason (or rather, approximately reason). Since to date these lines of work progressed largely in parallel (with notable exceptions), we are interested in investigating the intersection: probing for reasoning about the implicitly-held knowledge. Suspecting the performance to be lacking in this area, we use a very simple set-up of comparisons between cardinalities associated with elements of various subjects (e.g. the number of legs a bird has versus the number of wheels on a tricycle). We empirically demonstrate that although LLMs make steady progress in knowledge acquisition and (pseudo)reasoning with each new GPT release, their capabilities are limited to statistical inference only. It is difficult to argue that pure statistical learning can cope with the combinatorial explosion inherent in many commonsense reasoning tasks, especially once arithmetical notions are involved. Further, we argue that bigger is not always better and chasing purely statistical improvements is flawed at the core, since it only exacerbates the dangerous conflation of the production of correct answers with genuine reasoning ability.
在大型语言模型时代,两个主要的兴趣领域是关于LLMs知道什么,以及它们是否能够进行推理(或者说,约等于推理)。因为迄今为止,这些领域的发展主要是并行的(当然,也有一些显著的例外),所以我们对此感兴趣的是调查这个交叉点:关于LLMs所隐含的知识的推理。我们怀疑,在这个领域,性能存在不足,因此我们使用一个非常简单的对比集来研究各种主题元素的相关基数(例如,一个鸟类有多少条腿与一个三轮车的轮子数量)。我们通过实验证明,尽管LLMs在知识获取和(伪)推理方面在每个GPT版本中都有进步,但它们的能力仅限于统计推断。很难认为,纯统计学习可以应对许多常识推理任务中固有的组合爆炸,尤其是在涉及到算术概念的情况下。此外,我们认为大并不一定就是更好的,因为过于关注统计改进会破坏正确答案与真正推理能力之间的危险混淆。
https://arxiv.org/abs/2404.19432
The growing prominence of large language models (LLMs) necessitates the exploration of their capabilities beyond English. This research investigates the Telugu language proficiency of ChatGPT and Gemini, two leading LLMs. Through a designed set of 20 questions encompassing greetings, grammar, vocabulary, common phrases, task completion, and situational reasoning, the study delves into their strengths and weaknesses in handling Telugu. The analysis aims to identify the LLM that demonstrates a deeper understanding of Telugu grammatical structures, possesses a broader vocabulary, and exhibits superior performance in tasks like writing and reasoning. By comparing their ability to comprehend and use everyday Telugu expressions, the research sheds light on their suitability for real-world language interaction. Furthermore, the evaluation of adaptability and reasoning capabilities provides insights into how each LLM leverages Telugu to respond to dynamic situations. This comparative analysis contributes to the ongoing discussion on multilingual capabilities in AI and paves the way for future research in developing LLMs that can seamlessly integrate with Telugu-speaking communities.
大语言模型(LLMs)越来越突出,需要探索其超越英语的能力。这项研究调查了ChatGPT和Gemini这两个领先LLM在泰米尔语上的能力。通过设计一组20个问题涵盖问候、语法、词汇、常用短语、任务完成和情境推理,研究深入探讨了它们在处理泰米尔语方面的优势和不足。分析旨在确定展示对泰米尔语语法结构有更深刻理解、具有更丰富词汇、并且在写作和推理任务上表现优异的LLM。通过比较它们理解和使用日常泰米尔语表达的能力,这项研究揭示了它们在现实语言交互中的适用性。此外,对可扩展性和推理能力评估提供了关于每个LLM如何利用泰米尔语应对动态情况的见解。这种比较分析为讨论人工智能的多语言能力以及为开发可以无缝融入泰米尔语社区LLM的研究奠定了基础。
https://arxiv.org/abs/2404.19369
Language models have been effective in a wide range of applications, yet the most sophisticated models are often proprietary. For example, GPT-4 by OpenAI and various models by Anthropic are expensive and consume substantial energy. In contrast, the open-source community has produced competitive models, like Llama3. Furthermore, niche-specific smaller language models, such as those tailored for legal, medical or financial tasks, have outperformed their proprietary counterparts. This paper introduces a novel approach that employs \textit{functional tokens} to integrate \textbf{multiple open-source models}, each optimized for particular tasks. Our newly developed Octopus v4 model leverages \textit{functional tokens} to intelligently direct user queries to the most appropriate vertical model and reformat the query to achieve the best performance. Octopus v4, an evolution of the Octopus v1, v2, and v3 models, excels in selection and parameter understanding and reformatting. Additionally, we explore the use of graph as a versatile data structure that effectively coordinates multiple open-source models by harnessing the capabilities of the Octopus model and \textit{functional tokens}. Use our open-sourced GitHub (\url{this https URL}) to try Octopus v4 models (\url{this https URL}), and contrite to a larger graph of language models. By activating models less than 10B parameters, we achieved SOTA MMLU score of 74.8 among the same level models.
语言模型已经在广泛的应用中取得了成功,然而最先进的模型通常都是私有的。例如,OpenAI的GPT-4和Anthropic开发的 various模型价格昂贵,且能源消耗巨大。相比之下,开源社区已经产生了竞争力的模型,如Llama3。此外,针对法律、医疗或金融等特定领域的 niche-specific 小语言模型已经超过了其私有的对应模型。本文介绍了一种新方法,该方法采用功能词(functional tokens)将多个开源模型集成在一起,每个模型都针对特定的任务进行优化。我们开发的新型Octopus v4模型利用功能词智能地将用户查询引导到最合适的垂直模型,并重新格式化查询以实现最佳性能。Octopus v4 是 Octopus v1、v2 和 v3 模型的进化版,在选择和参数理解方面表现优异,并且通过利用 Octopus 模型的功能和功能词,有效地协调了多个开源模型。使用我们开源的 GitHub(https://github.com/this https URL)尝试Octopus v4 模型(https://github.com/this https URL),并向更大的语言模型道歉。通过激活模型参数低于10B的模型,我们在同一级别的模型中实现了SOTA MMLU得分74.8。
https://arxiv.org/abs/2404.19296
In this paper, we establish a benchmark for table visual question answering, referred to as the TableVQA-Bench, derived from pre-existing table question-answering (QA) and table structure recognition datasets. It is important to note that existing datasets have not incorporated images or QA pairs, which are two crucial components of TableVQA. As such, the primary objective of this paper is to obtain these necessary components. Specifically, images are sourced either through the application of a \textit{stylesheet} or by employing the proposed table rendering system. QA pairs are generated by exploiting the large language model (LLM) where the input is a text-formatted table. Ultimately, the completed TableVQA-Bench comprises 1,500 QA pairs. We comprehensively compare the performance of various multi-modal large language models (MLLMs) on TableVQA-Bench. GPT-4V achieves the highest accuracy among commercial and open-sourced MLLMs from our experiments. Moreover, we discover that the number of vision queries plays a significant role in TableVQA performance. To further analyze the capabilities of MLLMs in comparison to their LLM backbones, we investigate by presenting image-formatted tables to MLLMs and text-formatted tables to LLMs, respectively. Our findings suggest that processing visual inputs is more challenging than text inputs, as evidenced by the lower performance of MLLMs, despite generally requiring higher computational costs than LLMs. The proposed TableVQA-Bench and evaluation codes are available at \href{this https URL}{this https URL}.
在本文中,我们建立了一个名为TableVQA-Bench的基准,该基准是从现有的表格问题回答(QA)和表格结构识别数据集中派生的。值得注意的是,现有的数据集没有包含图像或QA对,这两是表格问题回答的核心组成部分。因此,本文的主要目标是为了获得必要的组成部分。具体来说,图像是通过应用样式表或使用所提出的表格渲染系统来获得的。QA对是通过利用大型语言模型(LLM)生成的。最终,完成的TableVQA-Bench包括1,500个QA对。我们全面比较了各种多模态大型语言模型(MLLMs)在TableVQA-Bench上的性能。GPT-4V在实验中实现了最高精度,这是从我们的实验中商业和开源MLLM中的最高精度。此外,我们发现,在TableVQA性能中,视觉查询的数量对性能有很大的影响。为了进一步分析大型语言模型与LLM后端的性能差异,我们通过将图像格式表格和文本格式表格分别提供给MLLMs和LLMs进行了研究。我们的研究结果表明,处理视觉输入比处理文本输入更具挑战性,这可以从MLLM的较低性能中看出,尽管通常需要比LLM更高的计算成本。所提出的TableVQA-Bench和评估代码可在此处访问:<https://this https URL>
https://arxiv.org/abs/2404.19205
Despite widespread applications of knowledge graphs (KGs) in various tasks such as question answering and intelligent conversational systems, existing KGs face two major challenges: information granularity and deficiency in timeliness. These hinder considerably the retrieval and analysis of in-context, fine-grained, and up-to-date knowledge from KGs, particularly in highly specialized themes (e.g., specialized scientific research) and rapidly evolving contexts (e.g., breaking news or disaster tracking). To tackle such challenges, we propose a theme-specific knowledge graph (i.e., ThemeKG), a KG constructed from a theme-specific corpus, and design an unsupervised framework for ThemeKG construction (named TKGCon). The framework takes raw theme-specific corpus and generates a high-quality KG that includes salient entities and relations under the theme. Specifically, we start with an entity ontology of the theme from Wikipedia, based on which we then generate candidate relations by Large Language Models (LLMs) to construct a relation ontology. To parse the documents from the theme corpus, we first map the extracted entity pairs to the ontology and retrieve the candidate relations. Finally, we incorporate the context and ontology to consolidate the relations for entity pairs. We observe that directly prompting GPT-4 for theme-specific KG leads to inaccurate entities (such as "two main types" as one entity in the query result) and unclear (such as "is", "has") or wrong relations (such as "have due to", "to start"). In contrast, by constructing the theme-specific KG step by step, our model outperforms GPT-4 and could consistently identify accurate entities and relations. Experimental results also show that our framework excels in evaluations compared with various KG construction baselines.
尽管知识图谱(KGs)在各种任务中的广泛应用,如问答和智能对话系统,现有KG面临两个主要挑战:信息粒度和时间不足。这些阻碍了从KGs中检索和分析上下文、细粒度和最新知识的能力,特别是在高度专业化的主题(例如,专业科学研究)和快速变化的环境(例如,新闻或灾害跟踪)中。为了应对这些挑战,我们提出了一个主题特定知识图(即 ThemeKG),一个基于主题特定语料库的知识图谱,并设计了用于 ThemeKG 构建的无监督框架(名为 TKGCon)。该框架从主题特定语料库中提取原始主题,然后通过大型语言模型(LLMs)生成候选关系,构建主题关系本体。为了解析主题语料库中的文档,我们首先将提取到的实体对映射到语料库,并检索候选关系。最后,我们将上下文和本体整合用于关系匹配。我们观察到,直接使用 GPT-4 生成主题特定 KG会导致不准确实体(例如查询结果中的“两个主要类型”作为一个实体),以及不清晰或错误的關係(例如“由於”或“开始于”)。相比之下,通过逐步构建主题特定 KG,我们的模型在比较各种 KG 建设基线方面表现出优异性能。实验结果还显示,我们的框架在各种 KG 建设基线上的评估中表现出色。
https://arxiv.org/abs/2404.19146
Recent research on instructable agents has used memory-augmented Large Language Models (LLMs) as task planners, a technique that retrieves language-program examples relevant to the input instruction and uses them as in-context examples in the LLM prompt to improve the performance of the LLM in inferring the correct action and task plans. In this technical report, we extend the capabilities of HELPER, by expanding its memory with a wider array of examples and prompts, and by integrating additional APIs for asking questions. This simple expansion of HELPER into a shared memory enables the agent to work across the domains of executing plans from dialogue, natural language instruction following, active question asking, and commonsense room reorganization. We evaluate the agent on four diverse interactive visual-language embodied agent benchmarks: ALFRED, TEACh, DialFRED, and the Tidy Task. HELPER-X achieves few-shot, state-of-the-art performance across these benchmarks using a single agent, without requiring in-domain training, and remains competitive with agents that have undergone in-domain training.
最近,关于指令式代理的研究使用了记忆增强的大型语言模型(LLMs)作为任务规划者,这是一种通过检索与输入指令相关的语言程序实例并使用它们作为LLM提示中的上下文示例来提高LLM在推断正确动作和任务计划中的性能的技术。在本文的技术报告中,我们通过扩展HELPER的功能,通过增加更广泛的示例和提示,以及添加询问API,来扩展其能力。这种简单的扩展使得HELPER可以应用于执行计划领域,包括对话、自然语言指令跟随、主动问题询问和共同空间组织。我们在四个具有多样性的交互式视觉语言 embodied 代理基准上评估了代理:ALFRED、TEACh、DialFRED 和 Tidy Task。HELPER-X在这些基准上使用单个代理实现了卓越的少样本、状态最先进的性能,而无需进行领域内训练,且与经过领域内训练的代理竞争。
https://arxiv.org/abs/2404.19065
This survey presents a comprehensive analysis of the phenomenon of hallucination in multimodal large language models (MLLMs), also known as Large Vision-Language Models (LVLMs), which have demonstrated significant advancements and remarkable abilities in multimodal tasks. Despite these promising developments, MLLMs often generate outputs that are inconsistent with the visual content, a challenge known as hallucination, which poses substantial obstacles to their practical deployment and raises concerns regarding their reliability in real-world applications. This problem has attracted increasing attention, prompting efforts to detect and mitigate such inaccuracies. We review recent advances in identifying, evaluating, and mitigating these hallucinations, offering a detailed overview of the underlying causes, evaluation benchmarks, metrics, and strategies developed to address this issue. Additionally, we analyze the current challenges and limitations, formulating open questions that delineate potential pathways for future research. By drawing the granular classification and landscapes of hallucination causes, evaluation benchmarks, and mitigation methods, this survey aims to deepen the understanding of hallucinations in MLLMs and inspire further advancements in the field. Through our thorough and in-depth review, we contribute to the ongoing dialogue on enhancing the robustness and reliability of MLLMs, providing valuable insights and resources for researchers and practitioners alike. Resources are available at: this https URL.
本次调查对多模态大型语言模型(MMLMs)的现象进行了全面分析,这些模型也被称为大型视觉语言模型(LVLMs),在多模态任务中取得了显著的进步和非凡的能力。尽管有这些鼓舞人心的发展和显著的进步,但MMLMs通常生成的输出与视觉内容不一致,这是一种称为幻觉的挑战,这对其实际部署造成了实质性的障碍,并对其可靠性在现实世界应用中提出了担忧。这个问题吸引了越来越多的关注,促使人们努力检测和缓解这种不准确。我们回顾了最近在识别、评估和缓解这种幻觉方面的最新进展,提供了对这个问题背后的原因、评估基准、指标和策略的详细概述。此外,我们分析了当前的挑战和局限性,提出了开放性问题,勾勒出未来研究的潜在路径。通过深入研究幻觉的原因、评估基准和缓解方法,本次调查旨在加深人们对MMLMs幻觉的理解,并为该领域的进一步发展提供有益的见解和资源。资源可在此链接中获取:https://this.url
https://arxiv.org/abs/2404.18930
Remote Sensing Image Change Captioning (RSICC) aims to identify surface changes in multi-temporal remote sensing images and describe them in natural language. Current methods typically rely on an encoder-decoder architecture and focus on designing a sophisticated neck to process bi-temporal features extracted by the backbone. Recently, State Space Models (SSMs), especially Mamba, have demonstrated outstanding performance in many fields, owing to their efficient feature-selective modelling capability. However, their potential in the RSICC task remains unexplored. In this paper, we introduce Mamba into RSICC and propose a novel approach called RSCaMa (Remote Sensing Change Captioning Mamba). Specifically, we utilize Siamese backbones to extract bi-temporal features, which are then processed through multiple CaMa layers consisting of Spatial Difference-guided SSM (SD-SSM) and Temporal Traveling SSM (TT-SSM). SD-SSM uses differential features to enhance change perception, while TT-SSM promotes bitemporal interactions in a token-wise cross-scanning manner. Experimental results validate the effectiveness of CaMa layers and demonstrate the superior performance of RSCaMa, as well as the potential of Mamba in the RSICC task. Additionally, we systematically compare the effects of three language decoders, including Mamba, GPT-style decoder with causal attention mechanism, and Transformer decoder with cross-attention mechanism. This provides valuable insights for future RSICC research. The code will be available at this https URL
远地遥感图像变化捕捉(RSICC)旨在通过自然语言描述多时程遥感图像表面的变化。目前的方法通常依赖于编码器-解码器架构,并专注于设计一个复杂的颈以处理由骨干网络提取的生物时程特征。最近,状态空间模型(SSMs)特别是Mamba在许多领域表现出卓越的性能,因为它们具有高效的特征选择建模能力。然而,在RSICC任务中,它们的表现潜力仍未经探索。在本文中,我们将Mamba引入RSICC,并提出了名为RSCaMa(远程遥感变化捕捉Mamba)的新方法。具体来说,我们利用序列到序列(Siamese)网络提取生物时程特征,然后通过多个CaMa层进行处理,包括空间差异引导的状态空间模型(SD-SSM)和时间旅行状态空间模型(TT-SSM)。SD-SSM使用差分特征来增强变化感知,而TT-SSM通过逐个扫描的token级跨扫描方式促进比特时相互作用。实验结果验证了CaMa层的有效性,并证明了RSCaMa以及Mamba在RSICC任务中的卓越表现。此外,我们系统地比较了三种语言解码器,包括Mamba、具有因果注意力机制的GPT风格解码器和具有跨注意力机制的Transformer解码器。这为未来的RSICC研究提供了宝贵的洞见。代码将在此处https:// URL中提供。
https://arxiv.org/abs/2404.18895
As Large Language Models (LLMs) have become more advanced, they have outpaced our abilities to accurately evaluate their quality. Not only is finding data to adequately probe particular model properties difficult, but evaluating the correctness of a model's freeform generation alone is a challenge. To address this, many evaluations now rely on using LLMs themselves as judges to score the quality of outputs from other LLMs. Evaluations most commonly use a single large model like GPT4. While this method has grown in popularity, it is costly, has been shown to introduce intramodel bias, and in this work, we find that very large models are often unnecessary. We propose instead to evaluate models using a Panel of LLm evaluators (PoLL). Across three distinct judge settings and spanning six different datasets, we find that using a PoLL composed of a larger number of smaller models outperforms a single large judge, exhibits less intra-model bias due to its composition of disjoint model families, and does so while being over seven times less expensive.
随着大型语言模型(LLMs)变得更加先进,它们已经超越了我们对准确评估其质量的能力。不仅找到足够的数据来充分检验特定模型的特性困难,而且仅评估模型的自由形式生成正确性就是一个挑战。为了解决这个问题,许多评估现在依赖使用LLMs本身作为评判者来评分其他LLMs的输出质量。最常见的评估方法使用单个大型模型GPT4。虽然这种方法变得越来越受欢迎,但代价昂贵,还表明引入了内模型偏见。在这项工作中,我们发现,大型模型往往是不必要的。我们提出了一种通过一组LLM评估者(PoLL)来评估模型的方法。在三个不同的评判设置和跨越六个不同数据集的情况下,我们发现,由多个较小的模型组成的PoLL比单个大型评判者表现更好,因为它由不同的模型家族组成,表现出了较少的内模型偏见。而且,即使花费的成本是单个大型评判者的七倍之多,它也表现出更好的性能。
https://arxiv.org/abs/2404.18796
Explanatory inference is the creation and evaluation of hypotheses that provide explanations, and is sometimes known as abduction or abductive inference. Generative AI is a new set of artificial intelligence models based on novel algorithms for generating text, images, and sounds. This paper proposes a set of benchmarks for assessing the ability of AI programs to perform explanatory inference, and uses them to determine the extent to which ChatGPT, a leading generative AI model, is capable of making explanatory inferences. Tests on the benchmarks reveal that ChatGPT performs creative and evaluative inferences in many domains, although it is limited to verbal and visual modalities. Claims that ChatGPT and similar models are incapable of explanation, understanding, causal reasoning, meaning, and creativity are rebutted.
推理解释是一种创建和评估提供解释的假设的方法,有时也称为类比或演绎推理。生成型AI是一种基于新算法生成文本、图像和声音的新的人工智能模型。本文提出了一组用于评估AI程序进行推理解释的基准,并使用这些基准来确定ChatGPT等领先生成型AI模型在进行推理解释方面的能力。基准测试的结果表明,ChatGPT在许多领域都表现出创造性和评价性推理,尽管它限于口头和视觉模态。关于ChatGPT和类似模型无法进行解释、理解、因果推理、意义和创造力的观点被反驳了。
https://arxiv.org/abs/2404.18982
Recent advancements in large language models (LLMs) have showcased their exceptional abilities across various tasks, such as code generation, problem-solving and reasoning. Existing benchmarks evaluate tasks in isolation, yet the extent to which LLMs can understand prose-style tasks, identify the underlying problems, and then generate appropriate code solutions is still unexplored. Addressing this gap, we introduce PECC, a novel benchmark derived from Advent Of Code (AoC) challenges and Project Euler, including 2396 problems. Unlike conventional benchmarks, PECC requires LLMs to interpret narrative-embedded problems, extract requirements, and generate executable code. A key feature of our dataset is the complexity added by natural language prompting in chat-based evaluations, mirroring real-world instruction ambiguities. Results show varying model performance between narrative and neutral problems, with specific challenges in the Euler math-based subset with GPT-3.5-Turbo passing 50% of the AoC challenges and only 8% on the Euler problems. By probing the limits of LLMs' capabilities, our benchmark provides a framework to monitor and assess the subsequent progress of LLMs as a universal problem solver.
近年来,大型语言模型(LLMs)在各种任务上的表现已经突出了其惊人的能力,例如代码生成、解决问题和推理。现有的基准测试通常是孤立地进行评估的,然而LLMs在理解散文式任务、识别潜在问题,然后生成适当的代码解决方案方面仍然是一个尚未探索的领域。为填补这一空白,我们引入了PECC基准,这是从Advent of Code(AoC)挑战和Project Euler项目派生而来的新基准,包括2396个问题。与传统基准不同,PECC要求LLMs解释嵌入式叙述问题的自然语言提示,提取需求,并生成可执行代码。我们数据集中的关键特征是自然语言提示在聊天评估中增加了复杂性,反映了真实世界指令的不确定性。结果表明,在叙述和中立问题上的模型表现有所不同,特别是在基于GPT-3.5-Turbo的Euler数学问题中,GPT-3.5-Turbo通过了50%的AoC挑战,只对Euler问题产生了8%的解决方案。通过探索LLMs能力的极限,我们的基准为监控和评估LLMs作为一个通用问题解决者后的进步提供了框架。
https://arxiv.org/abs/2404.18766
Recent research in dialogue systems and corpora has focused on two main categories: task-oriented (TOD) and open-domain (chit-chat) dialogues. TOD systems help users accomplish specific tasks, while open-domain systems aim to create engaging conversations. However, in real-world scenarios, user intents are often revealed during interactions. A recent study introduced SalesBot, which simulates dialogues transitioning from chit-chat to task-oriented scenarios to train sales agents. Unfortunately, the initial data lacked smooth transitions and coherent long-turn dialogues, resulting in poor naturalness in sales-customer interactions. To address these issues, this paper presents SalesBot 2.0, an improved dataset. It leverages commonsense knowledge from large language models (LLMs) through strategic prompting. Additionally, we introduce a novel model called SalesAgent, trained on salesperson's interactions, using chain-of-thought (CoT) reasoning. This model excels in transitioning topics, understanding user intents, and selecting appropriate strategies. Experiments using diverse user simulations validate the effectiveness of our method in controlling dialogue strategies in LLMs. Furthermore, SalesBot 2.0 enhances coherence and reduces aggression, facilitating better model learning for sales-customer interactions.
近年来,在对话系统和语料库研究中,主要集中于两个主要类别:任务导向对话(TOD)和开放域对话(CHIT-CHAT)。TOD系统帮助用户完成特定任务,而开放域系统旨在创建有趣的对话。然而,在现实世界的场景中,用户意图通常在互动过程中揭示。一项最近的研究引入了SalesBot,它通过模拟从CHIT-CHAT到任务导向场景的对话过渡来训练销售代理。然而,最初的数据显示,数据缺乏平滑的过渡和连贯的长轮对话,导致销售-客户互动的自然度较差。为了解决这些问题,本文提出了SalesBot 2.0,一个改进的数据集。它通过大规模语言模型(LLMs)的常识知识来进行策略提示。此外,我们引入了一种名为SalesAgent的新模型,使用链式思维(CoT)进行训练。这种模型在转移主题、理解用户意图和选择适当的策略方面表现出色。使用多样化的用户模拟实验证实了我们在LLMs上控制对话策略的有效性。此外,SalesBot 2.0还提高了连贯性并减少了攻击性,有助于提高模型学习销售-客户互动的效果。
https://arxiv.org/abs/2404.18564
Automated essay scoring (AES) to evaluate second language (L2) proficiency has been a firmly established technology used in educational contexts for decades. Although holistic scoring has seen advancements in AES that match or even exceed human performance, analytic scoring still encounters issues as it inherits flaws and shortcomings from the human scoring process. The recent introduction of large language models presents new opportunities for automating the evaluation of specific aspects of L2 writing proficiency. In this paper, we perform a series of experiments using GPT-4 in a zero-shot fashion on a publicly available dataset annotated with holistic scores based on the Common European Framework of Reference and aim to extract detailed information about their underlying analytic components. We observe significant correlations between the automatically predicted analytic scores and multiple features associated with the individual proficiency components.
自动论文评分(AES)作为一种在教育环境中用于评估第二语言(L2)能力的成熟技术,已经确立了几十年。尽管整体评分在AES方面已经取得了与人类评分相匹敌甚至超越人类评分的进步,但分析评分仍然存在问题,因为它继承了人类评分过程的缺陷和不足。最近引入的大规模语言模型为自动化评估L2写作能力的具体方面提供了新的机会。在本文中,我们在基于整体评分公开可用的数据集上使用GPT-4进行零散实验,旨在提取关于其底层分析组件的详细信息。我们观察到自动预测的分析分数与多个与个人能力相关的特征之间存在显著的相关性。
https://arxiv.org/abs/2404.18557
Large language models (LLMs) are often trained on extensive, temporally indiscriminate text corpora, reflecting the lack of datasets with temporal metadata. This approach is not aligned with the evolving nature of language. Conventional methods for creating temporally adapted language models often depend on further pre-training static models on time-specific data. This paper presents a new approach: a series of point-in-time LLMs called Time Machine GPT (TiMaGPT), specifically designed to be nonprognosticative. This ensures they remain uninformed about future factual information and linguistic changes. This strategy is beneficial for understanding language evolution and is of critical importance when applying models in dynamic contexts, such as time-series forecasting, where foresight of future information can prove problematic. We provide access to both the models and training datasets.
大语言模型(LLMs)通常在广泛的、时间不确定的文本语料库上进行训练,反映了缺乏具有时间元数据的数据集。这种方法与语言的不断变化的特点不符。创建时间适应语言模型的传统方法通常依赖于在时间特定数据上进行进一步预训练的静态模型。本文提出了一种新方法:一系列名为Time Machine GPT(TiMaGPT)的点时LLM,专门设计为非预测性。这确保它们不会对未来事实信息和语言变化产生影响。这种策略对于理解语言演变非常重要,尤其是在应用于动态场景(如时间序列预测)时,对预见未来信息可能会出现问题时尤为重要。我们提供了模型的访问以及训练数据。
https://arxiv.org/abs/2404.18543
By training on text in various languages, large language models (LLMs) typically possess multilingual support and demonstrate remarkable capabilities in solving tasks described in different languages. However, LLMs can exhibit linguistic discrimination due to the uneven distribution of training data across languages. That is, LLMs are hard to keep the consistency of responses when faced with the same task but depicted in different languages. In this study, we first explore the consistency in the LLMs' outputs responding to queries in various languages from two aspects: safety and quality. We conduct this analysis with two datasets (AdvBench and NQ) based on four LLMs (Llama2-13b, Gemma-7b, GPT-3.5-turbo and Gemini-pro). The results show that LLMs exhibit stronger human alignment capabilities with queries in English, French, Russian, and Spanish (only 1.04\% of harmful queries successfully jailbreak on average) compared to queries in Bengali, Georgian, Nepali and Maithili (27.7\% of harmful queries jailbreak successfully on average). Moreover, for queries in English, Danish, Czech and Slovenian, LLMs tend to produce responses with a higher quality (with 0.1494 $F_1$ score on average) compared to the other languages. Upon these findings, we propose LDFighter, a similarity-based voting, to mitigate the linguistic discrimination in LLMs. LDFighter ensures consistent service for different language speakers. We evaluate LDFighter with both benign queries and harmful queries. The results show that LDFighter not only significantly reduces the jailbreak success rate but also improve the response quality on average, demonstrating its effectiveness.
通过训练各种语言的文本,大型语言模型(LLMs)通常具有多语言支持,并在解决不同语言描述的任务时表现出非凡的能力。然而,由于训练数据在语言之间的分布不均,LLMs可能会表现出语言歧视。那就是,LLMs在面临相同任务的不同语言表现时很难保持回答的一致性。在这项研究中,我们首先从安全和质量两个方面探讨LLMs在各种语言中输出的一致性:安全性高,回答质量好。我们对四种LLM(Llama2-13b,Gemma-7b,GPT-3.5-turbo和Gemini-pro)的两种数据集(AdvBench和NQ)进行了分析。结果显示,与英语、法语、俄语和西班牙语相比,LLMs在英语、法语、俄语和西班牙语上的查询更倾向于表现出人类一致性,平均只有1.04%的恶意查询成功越狱(在平均情况下,只有1.04%的恶意查询成功越狱)。此外,对于英语、丹麦语、捷克语和斯洛文尼亚语的查询,LLMs往往产生具有更高质量(平均为0.1494 $F_1$得分)的回答,与其他语言相比优势明显。在这些发现的基础上,我们提出了LDFighter,一种基于相似度的投票方法,以减轻LLMs的语言歧视。LDFighter确保为不同语言使用者提供一致的服务。我们用两种方式评估LDFighter:正常查询和恶意查询。结果显示,LDFighter不仅显著减少了越狱成功率,而且改善了平均响应质量,充分证明了其有效性。
https://arxiv.org/abs/2404.18534
Despite the advancements and impressive performance of Multimodal Large Language Models (MLLMs) on benchmarks, their effectiveness in real-world, long-context, and multi-image tasks is unclear due to the benchmarks' limited scope. Existing benchmarks often focus on single-image and short-text samples, and when assessing multi-image tasks, they either limit the image count or focus on specific task (e.g time-series captioning), potentially obscuring the performance challenges of MLLMs. To address these limitations, we introduce MileBench, a pioneering benchmark designed to test the MultImodal Long-contExt capabilities of MLLMs. This benchmark comprises not only multimodal long contexts, but also multiple tasks requiring both comprehension and generation. We establish two distinct evaluation sets, diagnostic and realistic, to systematically assess MLLMs' long-context adaptation capacity and their ability to complete tasks in long-context scenarios. Our experimental results, obtained from testing 20 models, revealed that while the closed-source GPT-4(Vision) and Gemini 1.5 outperform others, most open-source MLLMs struggle in long-context situations. Interestingly, the performance gap tends to widen with an increase in the number of images. We strongly encourage an intensification of research efforts towards enhancing MLLMs' long-context capabilities, especially in scenarios involving multiple images.
尽管在基准测试中,多模态大型语言模型(MLLMs)的进步和出色的表现非常明显,但在现实世界、长文本和多图像任务中,它们在实际应用中的有效性尚不清楚,因为基准测试的范围有限。现有的基准测试通常仅关注单张图像和较短文本样本,当评估多图像任务时,它们可能限制图像数量或将注意力集中在特定任务(例如时间序列摘要),这可能会掩盖MLLMs在长文本场景中的性能挑战。为了应对这些限制,我们引入了MileBench,一个旨在测试MLLMs多模态长文本能力的开创性基准。该基准不仅包括多模态长文本,还包括需要理解和生成的多个任务。我们建立了两个不同的评估集,分别是诊断和现实主义评估,以系统地评估MLLMs在长文本场景中的长文本适应能力和完成任务的能力。我们对20个模型进行了实验,结果表明,虽然闭源的GPT-4(视觉)和Gemini 1.5的表现最好,但大多数开源MLLM在长文本环境中表现不佳。有趣的是,随着图像数量的增加,性能差距往往扩大。我们强烈鼓励在涉及多个图像的场景中加强研究努力,以提高MLLMs的长文本能力。
https://arxiv.org/abs/2404.18532
Generative large-scale language models create the fifth paradigm of scientific research, organically combine data science and computational intelligence, transform the research paradigm of natural language processing and multimodal information processing, promote the new trend of AI-enabled social science research, and provide new ideas for digital humanities research and application. This article profoundly explores the application of large-scale language models in digital humanities research, revealing their significant potential in ancient book protection, intelligent processing, and academic innovation. The article first outlines the importance of ancient book resources and the necessity of digital preservation, followed by a detailed introduction to developing large-scale language models, such as ChatGPT, and their applications in document management, content understanding, and cross-cultural research. Through specific cases, the article demonstrates how AI can assist in the organization, classification, and content generation of ancient books. Then, it explores the prospects of AI applications in artistic innovation and cultural heritage preservation. Finally, the article explores the challenges and opportunities in the interaction of technology, information, and society in the digital humanities triggered by AI technologies.
大型语言模型创建了第五个科学研究范式,将数据科学和计算智能自然结合,转变了自然语言处理和多模态信息处理的科研范式,促进了AI驱动的社会科学研究的兴起,并为数字人文研究及其应用提供了新的思路。本文深刻探讨了大型语言模型在数字人文研究中的应用,揭示了它们在古代书籍保护、智能处理和学术创新方面的重要潜力。文章首先概述了古代书籍资源的重要性和数字保存的必要性,然后详细介绍了发展大型语言模型的方法和它们在文档管理、内容理解和跨文化研究等领域的应用。通过具体案例,文章展示了AI在组织、分类和内容生成古代书籍方面的潜力。接着,它探讨了AI在艺术创新和文化遗产保护方面的前景。最后,文章探讨了AI技术在数字人文领域引发的技术、信息和社會相互作用所带来的挑战和机遇。
https://arxiv.org/abs/2404.18518