K-Modes聚类数据收集和发布过程中的混洗差分隐私保护方法

doi:10.11959/j.issn.1000-436x.2024004

通信学报 ›› 2024, Vol. 45 ›› Issue (1): 201-213.doi: 10.11959/j.issn.1000-436x.2024004

• 学术通信 • 上一篇

K-Modes聚类数据收集和发布过程中的混洗差分隐私保护方法

蒋伟进¹^,²^,³, 陈艺琳¹^,³, 韩裕清¹^,³, 吴玉庭¹^,³, 周为¹^,³, 王海娟³^,⁴

¹ 湖南工商大学计算机学院，湖南长沙 410205
² 武汉理工大学计算机与人工智能学院，湖北武汉 430070
³ 湘江实验室，湖南长沙 410205
⁴ 湖南工商大学前沿交叉学院，湖南长沙 410205

修回日期:2023-11-08 出版日期:2024-01-01 发布日期:2024-01-01
作者简介:蒋伟进（1964- ），男，湖南益阳人，博士，湖南工商大学教授、硕士生导师，主要研究方向为信息安全、网络安全和群智感知
陈艺琳（2000- ），女，河南许昌人，湖南工商大学硕士生，主要研究方向为信息安全和差分隐私
韩裕清（2000- ），男，湖南长沙人，湖南工商大学硕士生，主要研究方向为信息安全和联邦学习
吴玉庭（1998- ），女，湖南益阳人，湖南工商大学硕士生，主要研究方向为信息安全和群智感知
周为（2000- ），男，湖南益阳人，湖南工商大学硕士生，主要研究方向为信息安全和群智感知
王海娟（2000- ），女，江西九江人，湖南工商大学硕士生，主要研究方向为差分隐私和群智感知
基金资助:
国家自然科学基金资助项目(72088101);国家自然科学基金资助项目(61772196);湖南省自然科学基金重点资助项目(2020JJ4249);新零售虚拟现实技术湖南省重点实验室基金资助项目(2017TP1026);湖南省教育厅科学研究重点基金资助项目(21A0374);湖南省学位与研究生教学改革基金资助项目(2022JGYB194)

Shuffled differential privacy protection method for K-Modes clustering data collection and publication

Weijin JIANG¹^,²^,³, Yilin CHEN¹^,³, Yuqing HAN¹^,³, Yuting WU¹^,³, Wei ZHOU¹^,³, Haijuan WANG³^,⁴

¹ School of Computer Science, Hunan University of Technology and Business, Changsha 410205, China
² School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Wuhan 430070, China
³ Xiangjiang Laboratory, Changsha 410205, China
⁴ College of Frontier Intersection, Hunan University of Technology and Business, Changsha 410205, China

Revised:2023-11-08 Online:2024-01-01 Published:2024-01-01
Supported by:
The National Natural Science Foundation of China(72088101);The National Natural Science Foundation of China(61772196);The Key Project of the Natural Science Foundation of Hunan Province(2020JJ4249);Key Laboratory of Hunan Province for New Retail Virtual Reality Technology(2017TP1026);Key Scientific Research Project of Hunan Provincial Department of Education(21A0374);Hunan Provincial Degree and Graduate Teaching Reform Project(2022JGYB194)

摘要/Abstract

摘要：

针对目前聚类数据收集与发布安全性不足的问题，为保护聚类数据中的用户隐私并提高数据质量，基于混洗差分隐私模型，提出一种去可信第三方的K-Modes聚类数据收集和发布的隐私保护方法。首先，使用K-Modes聚类数据收集算法对用户数据进行采样并加噪，再通过填补取值域随机排列发布算法打乱采样数据的初始顺序，使恶意攻击者不能根据用户与数据之间的关系识别出目标用户。然后，尽可能减小噪声的干扰，利用循环迭代的方式计算出新的质心完成聚类。最后，从理论层面上分析了以上3种方法的隐私性、可行性和复杂度，并利用3个真实数据集和近年来具有权威性的同类算法 KM、DPLM、LDPKM 等进行准确率、熵值的对比，验证所提方法的有效性。实验结果表明，所提方法的隐私保护和发布数据质量均优于当前同类算法。

关键词: 混洗差分隐私, K-Modes聚类, 隐私保护, 数据收集, 数据发布

Abstract:

Aiming at the current problem of insufficient security in clustering data collection and publication, in order to protect user privacy and improve data quality in clustering data, a privacy protection method for K-Modes clustering data collection and publication was proposed without trusted third parties based on the shuffled differential privacy model.K-Modes clustering data collection algorithm was used to sample the user data and add noise, and then the initial order of the sampled data was disturbed by filling in the value domain random arrangement publishing algorithm.The malicious attacker couldn’t identify the target user according to the relationship between the user and the data, and then to reduce the interference of noise as much as possible a new centroid was calculated by cyclic iteration to complete the clustering.Finally, the privacy, feasibility and complexity of the above three methods were analyzed from the theoretical level, and the accuracy and entropy of the three real data sets were compared with the authoritative similar algorithms KM, DPLM and LDPKM in recent years to verify the effectiveness of the proposed model.The experimental results show that the privacy protection and data quality of the proposed method are superior to the current similar algorithms.

Key words: shuffled differential privacy, K-Modes clustering, privacy protection, data collection, data publication

中图分类号:

TP309

蒋伟进, 陈艺琳, 韩裕清, 吴玉庭, 周为, 王海娟. K-Modes聚类数据收集和发布过程中的混洗差分隐私保护方法[J]. 通信学报, 2024, 45(1): 201-213.

Weijin JIANG, Yilin CHEN, Yuqing HAN, Yuting WU, Wei ZHOU, Haijuan WANG. Shuffled differential privacy protection method for K-Modes clustering data collection and publication[J]. Journal on Communications, 2024, 45(1): 201-213.

图/表 9

表1

图1

图2

图3

图4

图5

图6

图7

图8

参考文献 14

[15]	BALCER V , CHEU A . Separating local ＆ shuffled differential privacy via histograms[J]. arXiv Preprint,arXiv:1911.06879, 2019.
[16]	方晨, 郭渊博, 王娜 ,等. 基于生成对抗网络的差分隐私数据发布方法[J]. 电子学报, 2020,48(10): 1983-1992.
	FANG C , GUO Y B , WANG N ,et al. Differential private data publishing method based on generative adversarial network[J]. Acta Electronica Sinica, 2020,48(10): 1983-1992.
[17]	LIU P J , LI H Y , WANG T Y ,et al. Multi-stage method for online vertical data partitioning based on spectral clustering[J]. Journal of Software, 2022,34(6): 2804-2832.
[18]	ZHANG X J , ZHANG J W , HUANG C ,et al. Verifiable encrypted medical data aggregation and statistical analysis scheme[J]. Journal of Software, 2022,33(11): 4285-4304.
[19]	LIANG W J , CHEN H , ZHAO S Y ,et al. A differentially private scheme for top-k frequent itemsets mining over data streams[J]. Chinese Journal of Computers, 2021,44(4): 741-760.
[20]	WANG J Y , LIU C , FU X C ,et al. Crucial patterns mining with differential privacy over data streams[J]. Journal of Software, 2019,30(3): 648-666.
[21]	CHEN S , FU A M , KE H F ,et al. MCDP:multi-cluster differential privacy data publishing method based on neural network[J]. Acta Electronica Sinica, 2020,48(12): 2297-2303.
[22]	TIAN F , WU Z Q , LU L F ,et al. Personalized differential privacy protection mechanism for trajectory data publishing[J]. Chinese Journal of Computer, 2021,44(4): 709-723.
[23]	张东月, 倪巍伟, 张森 ,等. 一种基于本地化差分隐私的网格聚类方法[J]. 计算机学报, 2023,46(2): 422-435.
	ZHANG D Y , NI W W , ZHANG S ,et al. A local differential privacy based privacy-preserving grid clustering method[J]. Chinese Journal of Computers, 2023,46(2): 422-435.
[1]	XU S Z , CHENG X , SU S ,et al. Differentially private frequent sequence mining[J]. IEEE Transactions on Knowledge and Data Engineering, 2016,28(11): 2910-2926.
[2]	WANG N , XIAO X K , YANG Y ,et al. PrivSuper:a superset-first approach to frequent itemset mining under differential privacy[C]// Proceedings of 2017 IEEE 33rd International Conference on Data Engineering (ICDE). Piscataway:IEEE Press, 2017: 809-820.
[3]	REN X B , YU C M , YU W R ,et al. LoPub:high-dimensional crowdsourced data publication with local differential privacy[J]. IEEE Transactions on Information Forensics and Security, 2018,13(9): 2151-2166.
[4]	WANG T H , LI N H , JHA S . Locally differentially private frequent itemset mining[C]// Proceedings of 2018 IEEE Symposium on Security and Privacy (SP). Piscataway:IEEE Press, 2018: 127-143.
[5]	BALLE B , BELL J , GASCóN A , et al . The privacy blanket of the shuffle model[C]// Proceedings of Annual International Cryptology Conference. Cham:Springer, 2019: 638-667.
[6]	ERLINGSSON ú , FELDMAN V , MIRONOV I ,et al. Amplification by shuffling:from local to central differential privacy via anonymity[C]// Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms. New York:ACM Press, 2019: 2468-2479.
[7]	WANG T , DING B , XU M ,et al. Improving utility and security of the shuffler-based differential privacy[J]. arXiv Preprint,arXiv:1908.11515, 2019.
[8]	LIU Y F , WANG N , WANG Z G ,et al. Collecting and analyzing multidimensional categorical data under shuffled differential privacy[J]. Journal of Software, 2022,33(3): 1093-1110.
[9]	ZHANG S B , YUAN L J , MAO X J ,et al. Privacy protection method for K-Modes clustering data with local differential privacy[J]. Acta Electronica Sinica, 2022,50(9): 2181-2188.
[10]	SASSI D B , FRINI A , CHAIEB M ,et al. A rough set-based competitive intelligence approach for anticipating competitor’s action[J]. Expert Systems With Applications, 2022,204:117523.
[11]	COELHO A L V , SANDES N C . Data clustering via cooperative games:a novel approach and comparative study[J]. Information Sciences, 2021,545: 791-812.
[24]	陆佳炜, 吴涵, 张元鸣 ,等. 融合功能语义关联计算与密度峰值检测的 Mashup 服务聚类方法[J]. 计算机学报, 2021,44(7): 1501-1515.
	LU J W , WU H , ZHANG Y M ,et al. Mashup service clustering method via integrating functional semantic association calculation and density peak detection[J]. Chinese Journal of Computers, 2021,44(7): 1501-1515.
[25]	LU S Y , WANG G H , QIU Z H ,et al. Differentially private algorithm for graphical bandits[J]. Journal of Software, 2022,33(9): 3223-3235.
[26]	BALAKRISHNAN S , SURESH KUMAR K , BALASUBRAMANIAN M ,et al. Opinion mining for breast cancer disease using apriori and k-modes clustering algorithm[C]// Rising Threats in Expert Applications and Solutions. Berlin:Springer, 2022: 43-51.
[27]	张啸剑, 付楠, 孟小峰 . 基于本地差分隐私的键-值数据精确收集方法[J]. 计算机学报, 2020,43(8): 1479-1492.
	ZHANG X J , FU N , MENG X F . Key-value data accurate collection under local differential privacy[J]. Chinese Journal of Computers, 2020,43(8): 1479-1492.
[28]	TENG W , YANG X Y , REN X B ,et al. Data-adaptive privacy-preserving mechanism for data stream publishing in real-time[J]. 2021,doi:10.1360/SSI-2020-0076.
[29]	OUYANG J , YIN J , XIAO Z H ,et al. Transaction data collection for itemset mining under local differential privacy[J]. Journal of Software, 2021,32(11): 3541-3562.
[30]	MANCHINI C , OSPINA R , LEIVA V ,et al. A new approach to data differential privacy based on regression models under heteroscedasticity with applications to machine learning repository data[J]. Information Sciences, 2023,627: 280-300.
[12]	XIAO Y Y , HUANG C H , HUANG J Y ,et al. Optimal mathematical programming and variable neighborhood search for k-modes categorical data clustering[J]. Pattern Recognition, 2019,90: 183-195.
[13]	DUAN Y Q , YUAN H L , LAI C S ,et al. Fusing local and global information for one-step multi-view subspace clustering[J]. Applied Sciences, 2022,12(10): 5094.
[14]	ZHANG X J , XU Y X , XIA Q R . Histogram publication under shuffled differential privacy[J]. Journal of Software, 2022,33(6): 2348-2363.

数据集	用户数	属性名称	属性域大小
Adult	30 000	Workclass	7
		Education	16
		Relationship	6
		Sex	2
		Race	5
IPUMS	30 000	School	3
		Famsize	15
		Sex	2
		Race	8
Kosarak	30 625	User	6
		Item	10

K-Modes聚类数据收集和发布过程中的混洗差分隐私保护方法

Shuffled differential privacy protection method for K-Modes clustering data collection and publication

在线阅读

PDF下载

可视化

摘要/Abstract

引用本文

使用本文

图/表 9

参考文献 14

相关文章 15

Metrics

推荐阅读 0

[1]	朱雨超, 王少尉. 物联网数据收集中基于负载均衡的无人机-车联合轨迹规划[J]. 通信学报, 2024, 45(1): 41-53.
[2]	王志恒, 徐彦彦. 室内定位隐私保护综述[J]. 通信学报, 2023, 44(9): 188-204.
[3]	马鑫迪, 李清华, 姜奇, 马卓, 高胜, 田有亮, 马建峰. 面向Non-IID数据的拜占庭鲁棒联邦学习[J]. 通信学报, 2023, 44(6): 138-153.
[4]	冯涛, 陈李秋, 方君丽, 石建明. 基于本地化差分隐私和属性基可搜索加密的区块链数据共享方案[J]. 通信学报, 2023, 44(5): 224-233.
[5]	胡柏吉, 张晓娟, 李元诚, 赖荣鑫. 支持多功能的V2G网络隐私保护数据聚合方案[J]. 通信学报, 2023, 44(4): 187-200.
[6]	夏莹杰, 朱思雨, 刘雪娇. 区块链架构下具有条件隐私的车辆编队跨信任域高效群组认证研究[J]. 通信学报, 2023, 44(4): 111-123.
[7]	徐明, 张保俊, 伍益明, 应晨铎, 郑宁. 面向网络攻击和隐私保护的多智能体系统分布式共识算法[J]. 通信学报, 2023, 44(3): 117-127.
[8]	余晟兴, 陈钟. 基于同态加密的高效安全联邦学习聚合框架[J]. 通信学报, 2023, 44(1): 14-28.
[9]	张学旺, 黎志鸿, 林金朝. 基于公平盲签名和分级加密的联盟链隐私保护方案[J]. 通信学报, 2022, 43(8): 131-141.
[10]	王继锋, 王国峰. 边缘计算模式下密文搜索与共享技术研究[J]. 通信学报, 2022, 43(4): 227-238.
[11]	封化民, 史瑞, 袁峰, 李艳俊, 杨旸. 高效的强隐私保护和可转让的属性票据方案[J]. 通信学报, 2022, 43(3): 63-75.
[12]	彭滔, 钟文韬, 王国军, 罗恩韬, 熊金波, 刘忆宁, Hao Wang. 移动社交网络中面向隐私保护的精确好友匹配[J]. 通信学报, 2022, 43(11): 90-103.
[13]	于海宁, 张宏莉, 余翔湛, 曲家兴, 葛蒙蒙. 隐私保护的轨迹相似度计算方法[J]. 通信学报, 2022, 43(11): 1-13.
[14]	史瑞, 封化民, 谢惠琴, 史国振, 刘飚, 杨旸. 基于带智能卡的移动终端实现的隐私保护的属性票据方案[J]. 通信学报, 2022, 43(10): 26-41.
[15]	晏燕, 丛一鸣, Adnan Mahmood, 盛权政. 基于深度学习的位置大数据统计发布与隐私保护方法[J]. 通信学报, 2022, 43(1): 203-216.