基于Spark框架和ASPSO的并行划分聚类算法

doi:10.11959/j.issn.1000-436x.2022054

通信学报 ›› 2022, Vol. 43 ›› Issue (3): 148-163.doi: 10.11959/j.issn.1000-436x.2022054

基于Spark框架和ASPSO的并行划分聚类算法

毛伊敏¹, 甘德瑾¹, 廖列法¹, 陈志刚²

¹ 江西理工大学信息工程学院，江西赣州 341000
² 中南大学计算机学院，湖南长沙 410083

修回日期:2021-12-10 出版日期:2022-03-25 发布日期:2022-03-01
作者简介:毛伊敏（1970- ），女，新疆伊犁人，博士，江西理工大学教授、博士生导师，主要研究方向为数据挖掘、大数据安全与隐私保护
甘德瑾（1997- ），男，江西抚州人，江西理工大学硕士生，主要研究方向为数据挖掘、大数据
廖列法（1975- ）,男，江西玉山人，博士，江西理工大学教授、硕士生导师，主要研究方向为人工智能等
陈志刚（1964- ），男，湖南益阳人，博士，中南大学教授、博士生导师，主要研究方向为网络与分布式计算、机会网络
基金资助:
国家自然科学基金资助项目(41562019);科技创新2030-“新一代人工智能”重大基金资助项目(2020AAA0109605)

Parallel division clustering algorithm based on Spark framework and ASPSO

Yimin MAO¹, Dejin GAN¹, Liefa LIAO¹, Zhigang CHEN²

¹ School of Information Engineering, Jiangxi University of Science and Technology, Ganzhou 341000, China
² College of Computer Science and Engineering, Central South University, Changsha 410083, China

Revised:2021-12-10 Online:2022-03-25 Published:2022-03-01
Supported by:
The National Natural Science Foundation of China(41562019);Technological Innovation 2030-Next-Generation Artificial Intelligence Major Projects(2020AAA0109605)

摘要/Abstract

摘要：

针对划分聚类算法处理海量的数据存在的数据离散系数较大与抗干扰性差、局部簇簇数难以确定、局部簇质心随机性及局部簇并行化合并效率低等问题，提出了一种基于Spark框架和粒子群优化自适应策略（ASPSO）的并行划分聚类（PDC-SFASPSO）算法。首先，提出了基于皮尔逊相关系数和方差的网格划分策略获取数据离散系数较小的网格单元并进行离群点过滤，解决了数据离散系数较大与抗干扰性差的问题；其次，提出了基于势函数与高斯函数的网格划分策略，获取局部聚类的簇数，解决了局部簇簇数难以确定的问题；再次，提出了ASPSO获取局部簇质心，解决了局部簇质心的随机性问题；最后，提出了基于簇半径与邻居节点的合并策略对相似度大的簇进行并行化合并，提高了局部簇并行化合并的效率。实验结果表明，PDC-SFASPSO 算法在大数据环境下进行数据的划分聚类具有较好的性能表现，适用于对大规模的数据集进行并行化聚类。

关键词: Spark框架, 并行划分聚类, 网格划分, 粒子群优化自适应策略, 并行化合并

Abstract:

To deal with the problems that the partition clustering algorithm for processing massive data encountered problems such as large data dispersion coefficient and poor anti-interference, difficulty to determine the number of local clusters, local cluster centroids randomness, and low efficiency of local cluster parallelization and merging, a parallel partition clustering algorithm based on Spark framework and ASPSO (PDC-SFAS PSO) was proposed.Firstly, a meshing strategy was introduced to reduce the data dispersion coefficient of the data division and improve anti-interference.Secondly, to determine the number of clusters, meshing strategy based on potential function and Gaussian function were proposed, which formed an area with different sample points as the core clusters, and obtained the number of local clusters.Then, to avoid local cluster centroids randomness, ASPSO was proposed.Finally, a local cluster merging strategy based on cluster radius and neighbor nodes was introduced to merge clusters with large similarity based on the Spark parallel computing framework, which improved the efficiency of parallel merging of local clusters.Experimental results showed that the PDC-SFASPSO algorithm has good performance in data partitioning and clustering in a big data environment, and it was suitable for parallel clustering of large-scale data sets.

Key words: Spark framework, parallel division clustering, grid division, ASPSO, parallel merge

中图分类号:

TP311

毛伊敏, 甘德瑾, 廖列法, 陈志刚. 基于Spark框架和ASPSO的并行划分聚类算法[J]. 通信学报, 2022, 43(3): 148-163.

Yimin MAO, Dejin GAN, Liefa LIAO, Zhigang CHEN. Parallel division clustering algorithm based on Spark framework and ASPSO[J]. Journal on Communications, 2022, 43(3): 148-163.

图/表 12

表1

表2

表3

基准测试函数"

类型	函数名称	函数表达式	取值范围	最优值
单峰	Sphere	$f_{1} (x) = \sum_{i = 1}^{d} x_{i}^{2}$	[-10,10]	0
	Schwefel	$f_{2} (x) = \sum_{i = 1}^{d} \| x_{i} \| + \prod_{i = 1}^{d} \| x_{i} \|$	[-100,100]	0
高维多峰	Ackely	$f_{3} (x) = 20 + e - 20 \exp (- 0.2 \sqrt{\frac{1}{d} \sum_{i = 1}^{d} x_{i}^{2}}) - \exp (\frac{1}{n} \sum_{i = 1}^{d} \cos (2 π x_{i}))$	[-32,32]	0
	Griewank	$f_{4} (x) = 1 + \sum_{i = 1}^{d} \frac{x_{i}^{2}}{4 000} - \prod_{i = 1}^{d} \cos (\frac{x_{i}}{\sqrt{i}})$	[-600,600]	0

表3

表4

图1

图2

图3

图4

表5

图5

图6

图7

参考文献 21

[1]	WANG P K , CHEN C H , PUN S H ,et al. Parallel architecture to accelerate super paramagnetic clustering algorithm[J]. Electronics Letters, 2020,56(14): 701-704.
[2]	KHAN A , ZUBAIR S . Expansion of regularized kmeans discretization machine learning approach in prognosis of dementia progression[C]// Proceedings of 2020 11th International Conference on Computing,Communication and Networking Technologies. Piscataway:IEEE Press, 2020: 1-6.
[3]	MARTANTO , ANWAR S , ROHMAT C L ,et al. Clustering of Internet network usage using the K-medoid method[J]. IOP Conference Series:Materials Science and Engineering, 2021,1088(1): 012036.
[4]	SCHUBERT E , ROUSSEEUW P J . Fast and eager k-medoids clustering:O(k) runtime improvement of the PAM,CLARA,and CLARANS algorithms[J]. Information Systems, 2021,101: 101804.
[5]	LEKHWAR S , YADAV S , SINGH A . Big data analytics in retail[R]. 2019.
[6]	WEISSMAN B , VAN D L E . Working with spark in big data clusters[R]. 2020.
[7]	MUGDHA S , CHIRAG P , AKASH A . Design and implementation of university network[J]. International Journal of Recent Technology and Engineering, 2019,8(26): 1199-1214.
[8]	王海艳, 肖亦康 . 基于密度峰值聚类的动态群组发现方法[J]. 计算机研究与发展, 2018,55(2): 391-399.
	WANG H Y , XIAO Y K . Dynamic group discovery based on density peaks clustering[J]. Journal of Computer Research and Development, 2018,55(2): 391-399.
[9]	WANG B W , YIN J , HUA Q ,et al. Parallelizing K-means-based clustering on spark[C]// Proceedings of 2016 International Conference on Advanced Cloud and Big Data. Piscataway:IEEE Press, 2016: 31-36.
[10]	徐鹏程, 王诚 . K-means算法改进及基于Spark计算模型的实现[J]. 南京邮电大学学报(自然科学版), 2017,37(4): 113-118.
	XU P C , WANG C . Improvement of K-means algorithm and implementation based on Spark computing model[J]. Journal of Nanjing University of Posts and Telecommunications (Natural Science Edition), 2017,37(4): 113-118.
[11]	MULTAZAM M T , DIJAYA R , DEVI N S . Index group optimization based on automatic clustering using K-means genetic algorithm[J]. Journal of Physics:Conference Series, 2019,1402(6): 066028.
[12]	许明杰, 蔚承建, 沈航 . 基于 Spark 的并行 K-means 算法研究[J]. 微电子学与计算机, 2018,35(5): 95-99.
	XU M J , WEI C J , SHEN H . Research on K-means algorithm of Spark parallelization[J]. Microelectronics ＆ Computer, 2018,35(5): 95-99.
[13]	GAO H J , LI Y T , KABALYANTS P ,et al. A novel hybrid PSO-K-means clustering algorithm using Gaussian estimation of distribution method and Lévy flight[J]. IEEE Access, 2020,8: 122848-122863.
[14]	AGRAWAL S , PATEL A . SAG cluster:an unsupervised graph clustering based on collaborative similarity for community detection in complex networks[J]. Physica A:Statistical Mechanics and Its Applications, 2021,563: 125459.
[15]	LAI M J , MCKENZIE D . Compressive sensing for cut improvement and local clustering[J]. SIAM Journal on Mathematics of Data Science, 2020,2(2): 368-395.
[16]	裴继红, 谢维信 . 势函数聚类自适应多阈值图像分割[J]. 计算机学报, 1999,22(7): 758-762.
	PEI J H , XIE W X . Adaptive multi thresholds image segmentation based on potential function clustering[J]. Chinese Journal of Computers, 1999,22(7): 758-762.
[17]	ZHANG Y L , HAN J . Differential privacy fuzzy C-means clustering algorithm based on Gaussian kernel function[J]. PLoS One, 2021,16(3): e0248737.
[18]	赵姝, 许显胜, 华波 ,等. 收缩邻居节点集方法求解有向网络的最大流问题[J]. 模式识别与人工智能, 2013,26(5): 425-431.
	ZHAO S , XU X S , HUA B ,et al. Contracting neighbor-node-set approach for solving maximum flow problem in directed network[J]. Pattern Recognition and Artificial Intelligence, 2013,26(5): 425-431.
[19]	PAULCHAMY B , CHIDAMBARAM S , JAYA J . An energy efficient neighbor node based clustering (EENNC) algorithm for wireless sensor networks[J]. Journal of Xidian University, 2020,14(6): 2483-2493.
[20]	SUN C , YUE S H , LI Q . Clustering characteristics of UCI dataset[C]// Proceedings of 2020 39th Chinese Control Conference (CCC). Piscataway:IEEE Press, 2020,13(5): 428-439.
[21]	DASH D R , DASH P K , BISOI R . Short term solar power forecasting using hybrid minimum variance expanded RVFLN and sine-cosine Levy flight PSO algorithm[J]. Renewable Energy, 2021,174: 513-537.

节点类型	主机名	IP地址
master	master	192.168.111.1
worker	slave_1	192.168.111.2
worker	slave_2	192.168.111.3
worker	slave_3	192.168.111.4

数据集	样本数/个	特征属性	文件大小/MB	数据特点
Online Retail	1 067 371	8	580	样本多，属性少
N_BaloT	7 062 606	115	960.5	样本多，属性相对适中
Health News	580 000	25 000	830.2	样本少，属性多
Bag words	8 000 000	1 000 000	2 687.9	样本多，属性多

函数	算法	中间值	平均值	标准差
f₁	PSO	6.487×10^-14	2.189×10^-13	1.584×10^-14
	MSPSO	8.432×10^-16	1.345×10^-13	5.365×10^-18
	ASPSO	3.458×10^-16	5.975×10^-15	2.858×10^-26
f₂	PSO	8.469×10^-17	7.486×10^-22	5.753×10^-8
	MSPSO	2.368×10^-17	9.325×10^-20	6.225×10^-12
	ASPSO	4.457×10^-18	6.733×10^-24	2.946×10^-22
f₃	PSO	6.445×10^-12	8.554×10^-8	6.332
	MSPSO	5.398×10^-13	4.443×10^-11	9.352×10^-4
	ASPSO	2.331×10^-14	9.328×10^-12	2.977×10^-10
f₄	PSO	4.474×10^-16	5.254×10^-12	3.289
	MSPSO	9.998×10^-18	8.887×10^-16	5.877×10^-0
	ASPSO	3.649×10^-22	3.977×10^-18	4.618×10^-8

算法	Online Retail	N_BaloT	Health News	Bag words
PDC-SFASPSO	0.895（± 0.027）	0.886（± 0.0372）	0.836（± 0.0563）	0.721（± 0.0775）
SP-DAP	0.766（± 0.0372）	0.721（± 0.0197）	0.691（± 0.0754）	0.591（± 0.0674）
SP-GAKMS	0.751（± 0.0196）	0.711（± 0.0542）	0.621（± 0.0621）	0.521（± 0.0788）
SP-LAICA	0.769（± 0.0126）	0.703（± 0.0278）	0.646（± 0.0487）	0.546（± 0.0597）

基于Spark框架和ASPSO的并行划分聚类算法

Parallel division clustering algorithm based on Spark framework and ASPSO

在线阅读

PDF下载

可视化

摘要/Abstract

引用本文

使用本文

图/表 12

参考文献 21

相关文章 15

Metrics

推荐阅读 0

[1]	张忠平, 李森, 刘伟雄, 刘书霞. 基于快速密度峰值聚类离群因子的离群点检测算法[J]. 通信学报, 2022, 43(10): 186-195.
[2]	黄桦烽, 苏璞睿, 杨轶, 贾相堃. 可控内存写漏洞自动利用生成方法[J]. 通信学报, 2022, 43(1): 83-95.
[3]	晏燕, 丛一鸣, Adnan Mahmood, 盛权政. 基于深度学习的位置大数据统计发布与隐私保护方法[J]. 通信学报, 2022, 43(1): 203-216.
[4]	张忠平, 刘伟雄, 张玉停, 邓禹, 魏棉鑫. ERDOF：基于相对熵权密度离群因子的离群点检测算法[J]. 通信学报, 2021, 42(9): 133-143.
[5]	刘玉红, 杨亮, 朴春慧, 张志国. 基于区块链的铁路工程施工安全监测数据共享关键技术研究[J]. 通信学报, 2021, 42(8): 206-216.
[6]	霍如, 倪东, 卢华, 夏云峰, 汪硕, 黄韬, 刘韵洁. 区块链PCN的高效路由策略[J]. 通信学报, 2021, 42(6): 30-40.
[7]	毛伊敏, 邓千虎, 陈志刚. 基于信息熵与遗传算法的并行关联规则增量挖掘算法[J]. 通信学报, 2021, 42(5): 122-136.
[8]	孟海宁, 童新宇, 石月开, 朱磊, 冯锴, 黑新宏. 基于ARIMA-RNN组合模型的云服务器老化预测方法[J]. 通信学报, 2021, 42(1): 163-171.
[9]	李梓杨,于炯,王跃飞,卞琛,蒲勇霖,张译天,刘宇. Flink环境下基于负载预测的弹性资源调度策略[J]. 通信学报, 2020, 41(10): 92-108.
[10]	闫宏强,王琳杰. 物联网中认证技术研究[J]. 通信学报, 2020, 41(7): 213-222.
[11]	蒲勇霖,于炯,鲁亮,李梓杨,卞琛,廖彬. 基于Storm平台的数据迁移合并节能策略[J]. 通信学报, 2019, 40(12): 68-85.
[12]	郭云川,李凌,李勇俊,成林,杜君,张玲翠. 基于动态模板的策略翻译及配置方法[J]. 通信学报, 2019, 40(12): 138-148.
[13]	贾春福,严盛博,王志,武辰璐,黎航. 提高fuzzing边覆盖率的改进方法[J]. 通信学报, 2019, 40(11): 76-85.
[14]	李梓杨,于炯,卞琛,张译天,蒲勇霖,王跃飞,鲁亮. 基于流网络的Flink平台弹性资源调度策略[J]. 通信学报, 2019, 40(8): 85-101.
[15]	李凤华,李丁焱,金伟,王竹,郭云川,耿魁. 面向海量电子凭据的分层可扩展存储架构[J]. 通信学报, 2019, 40(5): 79-87.