Abstract
Federated Learning (FL) enables multiple parties to collaboratively train machine learning models without sharing raw data. However, before training, data must be preprocessed to address missing values, inconsistent formats, and heterogeneous feature scales. This preprocessing stage is critical for model performance but is largely overlooked in FL research. In practical FL systems, privacy constraints prohibit centralizing raw data, while communication efficiency introduces further challenges for distributed preprocessing. We introduce FedPS, a unified framework for federated data preprocessing based on aggregated statistics. FedPS leverages data-sketching techniques to efficiently summarize local datasets while preserving essential statistical information. Building on these summaries, we design federated algorithms for feature scaling, encoding, discretization, and missing-value imputation, and extend preprocessing-related models such as k-Means, k-Nearest Neighbors, and Bayesian Linear Regression to both horizontal and vertical FL settings. FedPS provides flexible, communication-efficient, and consistent preprocessing pipelines for practical FL deployments.
Abstract (translated)
联邦学习(FL)使多个参与方能够在不共享原始数据的情况下协作训练机器学习模型。然而,在训练之前,必须对数据进行预处理以解决缺失值、格式不一致和特征尺度异质性等问题。这一预处理阶段对于模型性能至关重要,但在联邦学习的研究中往往被忽视。在实际的FL系统中,隐私约束禁止将原始数据集中化,而通信效率则给分布式预处理带来了进一步的挑战。为此,我们提出FedPS,这是一个基于聚合统计信息进行联邦数据预处理的统一框架。 FedPS利用数据素描技术高效地总结局部数据集的同时保留了重要的统计信息。在此基础上,我们为特征缩放、编码、离散化和缺失值填补设计了联邦算法,并将与预处理相关的模型(如k-Means、k-最近邻和支持向量机)扩展到水平和垂直的FL设置中。 FedPS提供了灵活且通信高效的预处理管道,能够支持实际部署中的联邦学习应用。
URL
https://arxiv.org/abs/2602.10870