Abstract
Open science initiatives have strengthened scientific integrity and accelerated research progress across many fields, but the state of their practice within transportation research remains under-investigated. Key features of open science, defined here as data and code availability, are difficult to extract due to the inherent complexity of the field. Previous work has either been limited to small-scale studies due to the labor-intensive nature of manual analysis or has relied on large-scale bibliometric approaches that sacrifice contextual richness. This paper introduces an automatic and scalable feature-extraction pipeline to measure data and code availability in transportation research. We employ Large Language Models (LLMs) for this task and validate their performance against a manually curated dataset and through an inter-rater agreement analysis. We applied this pipeline to examine 10,724 research articles published in the Transportation Research Part series of journals between 2019 and 2024. Our analysis found that only 5% of quantitative papers shared a code repository, 4% of quantitative papers shared a data repository, and about 3% of papers shared both, with trends differing across journals, topics, and geographic regions. We found no significant difference in citation counts or review duration between papers that provided data and code and those that did not, suggesting a misalignment between open science efforts and traditional academic metrics. Consequently, encouraging these practices will likely require structural interventions from journals and funding agencies to supplement the lack of direct author incentives. The pipeline developed in this study can be readily scaled to other journals, representing a critical step toward the automated measurement and monitoring of open science practices in transportation research.
Abstract (translated)
开放科学倡议已在多个领域加强了科学研究的诚信度并加速了研究进展,但在交通研究领域的实践状况却尚未得到充分调查。由于交通研究固有的复杂性,要提取其关键特征(这里定义为数据和代码的可用性)颇具挑战。以往的研究要么局限于小规模的手动分析研究,要么依赖于大型文献计量方法,从而牺牲了上下文的丰富度。本文介绍了一种自动且可扩展的功能提取流水线,用于测量交通研究中的数据和代码可用性。我们利用大规模语言模型(LLMs)来完成这项任务,并通过手动策划的数据集以及评阅人之间的一致性分析对其性能进行了验证。我们将该流程应用于2019年至2024年间在《运输研究》期刊系列中发表的10,724篇研究论文,其中定量论文只有5%分享了代码库,4%共享了数据存储库,大约3%同时共享两者。这些趋势因期刊、主题和地理区域而异。我们发现提供数据和代码的论文与那些未提供的论文在引用次数或审稿时间上没有显著差异,这表明开放科学努力与传统学术指标之间存在脱节。因此,鼓励此类实践可能需要期刊和资助机构采取结构性干预措施来弥补作者激励不足的问题。本研究开发出的流水线可轻松扩展至其他期刊,是自动测量和监控交通研究领域中的开放科学实践的重要一步。
URL
https://arxiv.org/abs/2601.14429