Abstract
Since late 2022, generative AI has taken the world by storm, with widespread use of tools including ChatGPT, Gemini, and Claude. Generative AI and large language model (LLM) applications are transforming how individuals find and access data and knowledge. However, the intricate relationship between open data and generative AI, and the vast potential it holds for driving innovation in this field remain underexplored areas. This white paper seeks to unpack the relationship between open data and generative AI and explore possible components of a new Fourth Wave of Open Data: Is open data becoming AI ready? Is open data moving towards a data commons approach? Is generative AI making open data more conversational? Will generative AI improve open data quality and provenance? Towards this end, we provide a new Spectrum of Scenarios framework. This framework outlines a range of scenarios in which open data and generative AI could intersect and what is required from a data quality and provenance perspective to make open data ready for those specific scenarios. These scenarios include: pertaining, adaptation, inference and insight generation, data augmentation, and open-ended exploration. Through this process, we found that in order for data holders to embrace generative AI to improve open data access and develop greater insights from open data, they first must make progress around five key areas: enhance transparency and documentation, uphold quality and integrity, promote interoperability and standards, improve accessibility and useability, and address ethical considerations.
Abstract (translated)
自2022年底以来,生成式人工智能(generative AI)彻底颠覆了世界,各种工具(包括ChatGPT、Gemini和Claude)的广泛应用使人们能够以全新的方式找到和访问数据和知识。生成式人工智能和大语言模型(LLM)应用正在改变个人如何发现和获取数据和知识的方式。然而,开放数据和生成式人工智能之间的关系以及它在推动这一领域创新方面所具有的广泛潜力仍然是未探索的领域。这份白皮书旨在解开开放数据和生成式人工智能之间的关系,并探讨可能的第四波开放数据的新组件:开放数据是否成为人工智能(AI)准备就绪?开放数据是否正朝着数据共享方法论演变?生成式人工智能是否使开放数据更具交互性?生成式人工智能是否改善了开放数据的质量和来源?为此,我们提供了一个新的场景框架。这个框架概述了开放数据和生成式人工智能在不同场景下可能产生的交集,以及从数据质量和来源角度看,开放数据在这些场景下做好准备所需的必要条件。这些场景包括:相关性、适应性、推理和洞察生成、数据增强和开放性探索。通过这个过程,我们发现,为了让数据持有者利用生成式人工智能改进开放数据访问并从开放数据中获得更大洞察,他们首先必须围绕五个关键领域取得进展:提高透明度和文档记录、维护质量和完整性、促进互操作性和标准、提高可访问性和可用性,以及解决道德问题。
URL
https://arxiv.org/abs/2405.04333