Abstract
In high dimensions, most machine learning methods are brittle to even a small fraction of structured outliers. To address this, we introduce a new meta-algorithm that can take in a base learner such as least squares or stochastic gradient descent, and harden the learner to be resistant to outliers. Our method, Sever, possesses strong theoretical guarantees yet is also highly scalable -- beyond running the base learner itself, it only requires computing the top singular vector of a certain $n \times d$ matrix. We apply Sever on a drug design dataset and a spam classification dataset, and find that in both cases it has substantially greater robustness than several baselines. On the spam dataset, with $1\%$ corruptions, we achieved $7.4\%$ test error, compared to $13.4\%-20.5\%$ for the baselines, and $3\%$ error on the uncorrupted dataset. Similarly, on the drug design dataset, with $10\%$ corruptions, we achieved $1.42$ mean-squared error test error, compared to $1.51$-$2.33$ for the baselines, and $1.23$ error on the uncorrupted dataset.
Abstract (translated)
在高维中,大多数机器学习方法甚至对结构异常值的一小部分都是脆弱的。为了解决这一问题,我们引入了一种新的元算法,该算法可以引入像最小二乘法或随机梯度下降法这样的基学习者,并使学习者对异常值有更强的抵抗力。我们的方法sever具有很强的理论保证,但也具有很高的可扩展性——除了运行基础学习者本身之外,它只需要计算特定$n乘以d$矩阵的顶部奇异向量。我们将sever应用于药物设计数据集和垃圾邮件分类数据集,发现在这两种情况下,它都比几个基线具有更大的健壮性。在垃圾邮件数据集上,通过$1\%$损坏,我们实现了$7.4\%$测试错误,而基线为$13.4\%-20.5\%$和未损坏数据集为$3\%$错误。同样,在药物设计数据集上,通过10%的腐败,我们获得了1.42美元的均方误差测试误差,而基线为1.51美元-2.33美元,未腐败数据集为1.23美元。
URL
https://arxiv.org/abs/1803.02815