Abstract
Peer review is at the heart of modern science. As submission numbers rise and research communities grow, the decline in review quality is a popular narrative and a common concern. Yet, is it true? Review quality is difficult to measure, and the ongoing evolution of reviewing practices makes it hard to compare reviews across venues and time. To address this, we introduce a new framework for evidence-based comparative study of review quality and apply it to major AI and machine learning conferences: ICLR, NeurIPS and *ACL. We document the diversity of review formats and introduce a new approach to review standardization. We propose a multi-dimensional schema for quantifying review quality as utility to editors and authors, coupled with both LLM-based and lightweight measurements. We study the relationships between measurements of review quality, and its evolution over time. Contradicting the popular narrative, our cross-temporal analysis reveals no consistent decline in median review quality across venues and years. We propose alternative explanations, and outline recommendations to facilitate future empirical studies of review quality.
Abstract (translated)
同行评审是现代科学的核心。随着提交的数量增加和研究社区的扩大,关于评审质量下降的说法变得流行且广泛担忧。然而,这种说法是否属实呢?评审质量难以衡量,并且审稿实践的持续演变使得跨平台和时间点比较评审变得困难。为了解决这一问题,我们引入了一个新的基于证据的研究评审质量的框架,并将其应用于主要的人工智能与机器学习会议:ICLR、NeurIPS 和 *ACL。我们记录了不同形式的评审多样性,并提出了一种新的评审标准化方法。我们还提出了一个多维度模式来量化评审质量作为对编辑和作者的效用,结合了LLM(大型语言模型)和其他轻量级测量方式。我们研究了评审质量度量之间的关系及其随时间的变化趋势。与流行说法相反,我们的跨时间段分析发现,各会议在不同年份的中位数评审质量没有持续下降的趋势。我们提出了替代解释,并概述了未来关于评审质量实证研究的建议。
URL
https://arxiv.org/abs/2601.15172