Abstract
Measuring public attitudes toward wildlife provides crucial insights into our relationship with nature and helps monitor progress toward Global Biodiversity Framework targets. Yet, conducting such assessments at a global scale is challenging. Manually curating search terms for querying news and social media is tedious, costly, and can lead to biased results. Raw news and social media data returned from queries are often cluttered with irrelevant content and syndicated articles. We aim to overcome these challenges by leveraging modern Natural Language Processing (NLP) tools. We introduce a folk taxonomy approach for improved search term generation and employ cosine similarity on Term Frequency-Inverse Document Frequency vectors to filter syndicated articles. We also introduce an extensible relevance filtering pipeline which uses unsupervised learning to reveal common topics, followed by an open-source zero-shot Large Language Model (LLM) to assign topics to news article titles, which are then used to assign relevance. Finally, we conduct sentiment, topic, and volume analyses on resulting data. We illustrate our methodology with a case study of news and X (formerly Twitter) data before and during the COVID-19 pandemic for various mammal taxa, including bats, pangolins, elephants, and gorillas. During the data collection period, up to 62% of articles including keywords pertaining to bats were deemed irrelevant to biodiversity, underscoring the importance of relevance filtering. At the pandemic's onset, we observed increased volume and a significant sentiment shift toward horseshoe bats, which were implicated in the pandemic, but not for other focal taxa. The proposed methods open the door to conservation practitioners applying modern and emerging NLP tools, including LLMs "out of the box," to analyze public perceptions of biodiversity during current events or campaigns.
Abstract (translated)
衡量公众对野生动物的态度为我们与自然的关系提供了关键见解,并有助于监测全球生物多样性框架目标的实现。然而,在全球范围内进行此类评估具有挑战性。手动策展关键词以进行新闻和社交媒体搜索是乏味、耗时且可能导致偏见结果的。从查询中返回的新闻和社交媒体数据通常充满无关内容和高尔顿文章。我们希望通过利用现代自然语言处理(NLP)工具来克服这些挑战。我们引入了一种民间分类学方法来改进搜索词生成,并使用余弦相似度在词频-逆文档频率向量上过滤 syndicated 文章。我们还引入了一个可扩展的相关过滤管道,使用无监督学习来揭示共同主题,然后使用开源零击大语言模型(LLM)将主题分配给新闻文章标题,这些标题随后用于确定相关性。最后,我们对结果数据进行情感、主题和数量分析。我们用蝙蝠、穿山甲、大象和刚果黑猩猩等各种哺乳动物类群在COVID-19疫情前和疫情期间的新闻和社交媒体数据进行案例研究,来说明我们的方法。在数据收集期间,包括与蝙蝠关键词相关的文章,有高达62%的文章被认为与生物多样性无关,这凸显了相关性过滤的重要性。在疫情初期,我们观察到穿山甲数量增加和情感倾向明显向穿山甲倾斜,这些穿山甲被认为是导致疫情的原因,但并非其他关键类群。所提出的方法为 conservation practitioners 在当前事件或活动中应用现代和新兴 NLP 工具(包括LLM "out of the box")分析公众对生物多样性
URL
https://arxiv.org/abs/2405.01610