Paper Reading AI Learner

The Promise and Challenges of Using LLMs to Accelerate the Screening Process of Systematic Reviews

2024-04-24 05:53:20
Aleksi Huotala, Miikka Kuutila, Paul Ralph, Mika Mäntylä

Abstract

Systematic review (SR) is a popular research method in software engineering (SE). However, conducting an SR takes an average of 67 weeks. Thus, automating any step of the SR process could reduce the effort associated with SRs. Our objective is to investigate if Large Language Models (LLMs) can accelerate title-abstract screening by simplifying abstracts for human screeners, and automating title-abstract screening. We performed an experiment where humans screened titles and abstracts for 20 papers with both original and simplified abstracts from a prior SR. The experiment with human screeners was reproduced with GPT-3.5 and GPT-4 LLMs to perform the same screening tasks. We also studied if different prompting techniques (Zero-shot (ZS), One-shot (OS), Few-shot (FS), and Few-shot with Chain-of-Thought (FS-CoT)) improve the screening performance of LLMs. Lastly, we studied if redesigning the prompt used in the LLM reproduction of screening leads to improved performance. Text simplification did not increase the screeners' screening performance, but reduced the time used in screening. Screeners' scientific literacy skills and researcher status predict screening performance. Some LLM and prompt combinations perform as well as human screeners in the screening tasks. Our results indicate that the GPT-4 LLM is better than its predecessor, GPT-3.5. Additionally, Few-shot and One-shot prompting outperforms Zero-shot prompting. Using LLMs for text simplification in the screening process does not significantly improve human performance. Using LLMs to automate title-abstract screening seems promising, but current LLMs are not significantly more accurate than human screeners. To recommend the use of LLMs in the screening process of SRs, more research is needed. We recommend future SR studies publish replication packages with screening data to enable more conclusive experimenting with LLM screening.

Abstract (translated)

系统综述法(SR)是软件工程领域(SE)中的一种流行研究方法。然而,进行SR平均需要67周的时间。因此,自动化SR过程中任何步骤都可能减少与SR相关的努力。我们的目标是调查大型语言模型(LLMs)是否可以通过简化摘要,从而加速标题摘要筛选,并自动化标题摘要筛选。我们进行了一项实验,其中人类对20篇具有原始和简化摘要的论文进行了筛选。使用人类筛选者和基于GPT-3.5和GPT-4的LLM进行了相同筛选任务。我们还研究了不同的提示技术(零击(ZS)、一次击(OS)、少量击(FS)和少量击与思考(FS-CoT))是否改善LLM的筛选性能。最后,我们研究了在LLM复制筛选提示的使用是否会导致性能提升。虽然文本简化没有提高筛选者的性能,但减少了筛选所需的时间。筛选者的科学素养和研究者身份预测了筛选绩效。一些LLM和提示组合在筛选任务中表现与人类筛选者相当。我们的结果表明,GPT-4 LLM比其前任GPT-3.5更好。此外,少量击和一次击提示优于零击提示。在筛选过程中使用LLM进行文本简化并没有显著提高人类性能。使用LLM自动进行标题摘要筛选看起来很有前途,但目前的LLM并没有比人类筛选者更准确。为了推荐在SR筛选过程中使用LLM,还需要进行更多的研究。我们建议,未来的SR研究者在SR研究中发布带有筛选数据的复制包,以促进更确凿的尝试使用LLM进行筛选。

URL

https://arxiv.org/abs/2404.15667

PDF

https://arxiv.org/pdf/2404.15667.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot