Abstract
State-of-the-art automated machine learning systems for tabular data often employ cross-validation; ensuring that measured performances generalize to unseen data, or that subsequent ensembling does not overfit. However, using k-fold cross-validation instead of holdout validation drastically increases the computational cost of validating a single configuration. While ensuring better generalization and, by extension, better performance, the additional cost is often prohibitive for effective model selection within a time budget. We aim to make model selection with cross-validation more effective. Therefore, we study early stopping the process of cross-validation during model selection. We investigate the impact of early stopping on random search for two algorithms, MLP and random forest, across 36 classification datasets. We further analyze the impact of the number of folds by considering 3-, 5-, and 10-folds. In addition, we investigate the impact of early stopping with Bayesian optimization instead of random search and also repeated cross-validation. Our exploratory study shows that even a simple-to-understand and easy-to-implement method consistently allows model selection to converge faster; in ~94% of all datasets, on average by ~214%. Moreover, stopping cross-validation enables model selection to explore the search space more exhaustively by considering +167% configurations on average within one hour, while also obtaining better overall performance.
Abstract (translated)
先进的自动机器学习系统通常采用交叉验证;确保测量的性能推广到未见过的数据,或者后续集成不会过拟合。然而,使用k-fold交叉验证代替holdout验证会极大地增加验证单个配置的计算成本。虽然确保更好的泛化性能(进而更好的性能)是值得的,但为了在时间预算内有效选择模型,这种额外的成本往往过于高昂。我们旨在使使用交叉验证进行模型选择更加有效。因此,我们在模型选择过程中研究早期停止。我们研究了MLP和random forest两种算法在36个分类数据集上的随机搜索过程中早期停止的影响。我们进一步分析了3-、5-和10-fold对模型选择的影响。此外,我们还研究了使用贝叶斯优化 instead of random search 和重复交叉验证对模型选择的影响。我们的探索性研究显示,即使是简单易理解且易于实现的方法,也能使模型选择收敛更快;在~94%的数据集上,平均每次加速214%。此外,停止交叉验证使模型选择能够更充分地探索搜索空间,通过考虑+167%的配置,平均在一个小时内获得更好的整体性能。
URL
https://arxiv.org/abs/2405.03389