Abstract
This paper focuses on a very important societal challenge of water quality analysis. Being one of the key factors in the economic and social development of society, the provision of water and ensuring its quality has always remained one of the top priorities of public authorities. To ensure the quality of water, different methods for monitoring and assessing the water networks, such as offline and online surveys, are used. However, these surveys have several limitations, such as the limited number of participants and low frequency due to the labor involved in conducting such surveys. In this paper, we propose a Natural Language Processing (NLP) framework to automatically collect and analyze water-related posts from social media for data-driven decisions. The proposed framework is composed of two components, namely (i) text classification, and (ii) topic modeling. For text classification, we propose a merit-fusion-based framework incorporating several Large Language Models (LLMs) where different weight selection and optimization methods are employed to assign weights to the LLMs. In topic modeling, we employed the BERTopic library to discover the hidden topic patterns in the water-related tweets. We also analyzed relevant tweets originating from different regions and countries to explore global, regional, and country-specific issues and water-related concerns. We also collected and manually annotated a large-scale dataset, which is expected to facilitate future research on the topic.
Abstract (translated)
本论文重点关注水质量分析这一重要的社会挑战。作为社会经济发展的重要因素,提供水资源并确保其质量始终是公共当局的头等大事。为了确保水质,采用了一些监测和评估水网络的方法,例如离线和在线调查。然而,这些调查存在一些局限性,例如参与人数有限和调查频率较低,因为这些调查需要大量的人力投入。在本文中,我们提出了一个自然语言处理(NLP)框架,用于自动收集和分析社交媒体上的与水相关的帖子,以支持数据驱动的决策。所提出的框架由两个组成部分组成,即(i)文本分类和(ii)主题建模。 在文本分类方面,我们提出了一个基于 merits-fusion 的框架,其中包含多个 large language models (LLMs)。为了给 LLMs 分配权重,我们采用了一些权重选择和优化方法。在主题建模方面,我们使用了 BERTopic 库来发现水相关微博中的隐藏主题模式。我们还分析了许多不同地区和国家的相关 tweets,以探索全球、区域和国家特定的水和相关问题。 我们还收集并手动标注了一个大规模数据集,预计将促进未来关于这个主题的研究。
URL
https://arxiv.org/abs/2404.14977