Paper Reading AI Learner

FedPS: Federated data Preprocessing via aggregated Statistics

2026-02-11 13:58:55
Xuefeng Xu, Graham Cormode

Abstract

Federated Learning (FL) enables multiple parties to collaboratively train machine learning models without sharing raw data. However, before training, data must be preprocessed to address missing values, inconsistent formats, and heterogeneous feature scales. This preprocessing stage is critical for model performance but is largely overlooked in FL research. In practical FL systems, privacy constraints prohibit centralizing raw data, while communication efficiency introduces further challenges for distributed preprocessing. We introduce FedPS, a unified framework for federated data preprocessing based on aggregated statistics. FedPS leverages data-sketching techniques to efficiently summarize local datasets while preserving essential statistical information. Building on these summaries, we design federated algorithms for feature scaling, encoding, discretization, and missing-value imputation, and extend preprocessing-related models such as k-Means, k-Nearest Neighbors, and Bayesian Linear Regression to both horizontal and vertical FL settings. FedPS provides flexible, communication-efficient, and consistent preprocessing pipelines for practical FL deployments.

Abstract (translated)

联邦学习(FL)使多个参与方能够在不共享原始数据的情况下协作训练机器学习模型。然而,在训练之前,必须对数据进行预处理以解决缺失值、格式不一致和特征尺度异质性等问题。这一预处理阶段对于模型性能至关重要,但在联邦学习的研究中往往被忽视。在实际的FL系统中,隐私约束禁止将原始数据集中化,而通信效率则给分布式预处理带来了进一步的挑战。为此,我们提出FedPS,这是一个基于聚合统计信息进行联邦数据预处理的统一框架。 FedPS利用数据素描技术高效地总结局部数据集的同时保留了重要的统计信息。在此基础上,我们为特征缩放、编码、离散化和缺失值填补设计了联邦算法,并将与预处理相关的模型(如k-Means、k-最近邻和支持向量机)扩展到水平和垂直的FL设置中。 FedPS提供了灵活且通信高效的预处理管道,能够支持实际部署中的联邦学习应用。

URL

https://arxiv.org/abs/2602.10870

PDF

https://arxiv.org/pdf/2602.10870.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot