通信学报 ›› 2023, Vol. 44 ›› Issue (1): 164-176.doi: 10.11959/j.issn.1000-436x.2023007

• 学术论文 • 上一篇    下一篇

面向非独立同分布数据的联邦学习数据增强方案

汤凌韬1, 王迪1, 刘盛云2   

  1. 1 数学工程与先进计算国家重点实验室,江苏 无锡 214125
    2 上海交通大学网络空间安全学院,上海 200240
  • 修回日期:2022-11-16 出版日期:2023-01-25 发布日期:2023-01-01
  • 作者简介:汤凌韬(1994- ),男,江苏启东人,数学工程与先进计算国家重点实验室博士生,主要研究方向为信息安全、机器学习和隐私保护等
    王迪(1993- ),女,江苏徐州人,数学工程与先进计算国家重点实验室硕士生,主要研究方向为人工智能芯片设计
    刘盛云(1985- ),男,云南昆明人,博士,上海交通大学助理教授,主要研究方向为区块链、联邦学习、分布式存储等
  • 基金资助:
    国家重点研发计划基金资助项目(2016YFB1000500);国家科技重大专项基金资助项目(2018ZX01028102)

Data augmentation scheme for federated learning with non-IID data

Lingtao TANG1, Di WANG1, Shengyun LIU2   

  1. 1 State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi 214125, China
    2 School of Cyber Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
  • Revised:2022-11-16 Online:2023-01-25 Published:2023-01-01
  • Supported by:
    The National Key Research and Development Program of China(2016YFB1000500);The National Science and Technology Major Project(2018ZX01028102)

摘要:

为了解决联邦学习节点间数据非独立同分布(non-IID)导致的模型精度不理想的问题,提出一种隐私保护的数据增强方案。首先,提出了面向联邦学习的数据增强框架,参与节点在本地生成虚拟样本并在节点间共享,有效缓解了训练过程中数据分布差异导致的模型偏移问题。其次,基于生成式对抗网络和差分隐私技术,设计了隐私保护的样本生成算法,在保证原数据隐私的前提下生成可用的虚拟样本。最后,提出了隐私保护的标签选取算法,保证虚拟样本的标签同样满足差分隐私。仿真结果表明,在多种 non-IID 数据划分策略下,所提方案均能有效提高模型精度并加快模型收敛,与基准方法相比,所提方案在极端non-IID场景下能取得25%以上的精度提升。

关键词: 联邦学习, 非独立同分布, 生成式对抗网络, 差分隐私, 数据增强

Abstract:

To solve the problem that the model accuracy remains low when the data are not independent and identically distributed (non-IID) across different clients in federated learning, a privacy-preserving data augmentation scheme was proposed.Firstly, a data augmentation framework for federated learning scenarios was designed.All clients generated synthetic samples locally and shared them with each other, which eased the problem of client drift caused by the difference of clients’ data distributions.Secondly, based on generative adversarial network and differential privacy, a private sample generation algorithm was proposed.It helped clients to generate informative samples while preserving the privacy of clients’ local data.Finally, a differentially private label selection algorithm was proposed to ensure the labels of synthetic samples will not leak information.Simulation results demonstrate that under multiple non-IID data partition strategies, the proposed scheme can consistently improve the model accuracy and make the model converge faster.Compared with the benchmark approaches, the proposed scheme can achieve at least 25% accuracy improvement when each client has only one class of samples.

Key words: federated learning, non-IID, generative adversarial network, differential privacy, data augmentation

中图分类号: 

No Suggested Reading articles found!