Abstract
The ICLR conference is unique among the top machine learning conferences in that all submitted papers are openly available. Here we present the ICLR dataset consisting of abstracts of all 24 thousand ICLR submissions from 2017-2024 with meta-data, decision scores, and custom keyword-based labels. We find that on this dataset, bag-of-words representation outperforms most dedicated sentence transformer models in terms of $k$NN classification accuracy, and the top performing language models barely outperform TF-IDF. We see this as a challenge for the NLP community. Furthermore, we use the ICLR dataset to study how the field of machine learning has changed over the last seven years, finding some improvement in gender balance. Using a 2D embedding of the abstracts' texts, we describe a shift in research topics from 2017 to 2024 and identify hedgehogs and foxes among the authors with the highest number of ICLR submissions.
Abstract (translated)
ICLR 会议是机器学习领域顶级会议中唯一所有提交论文都可以公开访问的会议。在这里,我们报道了 2017-2024 年所有 24,000 ICLR 提交的摘要、元数据、决策分数和自定义关键词基标签组成的 ICLR 数据集。我们发现,在这个数据集中,词袋表示在 $k$NN 分类准确率方面优于大多数专用句子变换模型,而表现最好的语言模型 barely 超过 TF-IDF。我们认为这是一个对自然语言处理社区的一个挑战。此外,我们使用 ICLR 数据集研究了机器学习领域在过去的七年里的变化,发现性别平衡有所改善。使用摘要文本的 2D 嵌入,我们描述了从 2017 年到 2024 年研究主题的变化,并找出了提交 ICLR 论文数量最多的作者中的 hedgehogs 和 foxes。
URL
https://arxiv.org/abs/2404.08403