通信学报 ›› 2022, Vol. 43 ›› Issue (3): 148-163.doi: 10.11959/j.issn.1000-436x.2022054

• 学术论文 • 上一篇    下一篇

基于Spark框架和ASPSO的并行划分聚类算法

毛伊敏1, 甘德瑾1, 廖列法1, 陈志刚2   

  1. 1 江西理工大学信息工程学院,江西 赣州 341000
    2 中南大学计算机学院,湖南 长沙 410083
  • 修回日期:2021-12-10 出版日期:2022-03-25 发布日期:2022-03-01
  • 作者简介:毛伊敏(1970- ),女,新疆伊犁人,博士,江西理工大学教授、博士生导师,主要研究方向为数据挖掘、大数据安全与隐私保护
    甘德瑾(1997- ),男,江西抚州人,江西理工大学硕士生,主要研究方向为数据挖掘、大数据
    廖列法(1975- ),男,江西玉山人,博士,江西理工大学教授、硕士生导师,主要研究方向为人工智能等
    陈志刚(1964- ),男,湖南益阳人,博士,中南大学教授、博士生导师,主要研究方向为网络与分布式计算、机会网络
  • 基金资助:
    国家自然科学基金资助项目(41562019);科技创新2030-“新一代人工智能”重大基金资助项目(2020AAA0109605)

Parallel division clustering algorithm based on Spark framework and ASPSO

Yimin MAO1, Dejin GAN1, Liefa LIAO1, Zhigang CHEN2   

  1. 1 School of Information Engineering, Jiangxi University of Science and Technology, Ganzhou 341000, China
    2 College of Computer Science and Engineering, Central South University, Changsha 410083, China
  • Revised:2021-12-10 Online:2022-03-25 Published:2022-03-01
  • Supported by:
    The National Natural Science Foundation of China(41562019);Technological Innovation 2030-Next-Generation Artificial Intelligence Major Projects(2020AAA0109605)

摘要:

针对划分聚类算法处理海量的数据存在的数据离散系数较大与抗干扰性差、局部簇簇数难以确定、局部簇质心随机性及局部簇并行化合并效率低等问题,提出了一种基于Spark框架和粒子群优化自适应策略(ASPSO)的并行划分聚类(PDC-SFASPSO)算法。首先,提出了基于皮尔逊相关系数和方差的网格划分策略获取数据离散系数较小的网格单元并进行离群点过滤,解决了数据离散系数较大与抗干扰性差的问题;其次,提出了基于势函数与高斯函数的网格划分策略,获取局部聚类的簇数,解决了局部簇簇数难以确定的问题;再次,提出了ASPSO获取局部簇质心,解决了局部簇质心的随机性问题;最后,提出了基于簇半径与邻居节点的合并策略对相似度大的簇进行并行化合并,提高了局部簇并行化合并的效率。实验结果表明,PDC-SFASPSO 算法在大数据环境下进行数据的划分聚类具有较好的性能表现,适用于对大规模的数据集进行并行化聚类。

关键词: Spark框架, 并行划分聚类, 网格划分, 粒子群优化自适应策略, 并行化合并

Abstract:

To deal with the problems that the partition clustering algorithm for processing massive data encountered problems such as large data dispersion coefficient and poor anti-interference, difficulty to determine the number of local clusters, local cluster centroids randomness, and low efficiency of local cluster parallelization and merging, a parallel partition clustering algorithm based on Spark framework and ASPSO (PDC-SFAS PSO) was proposed.Firstly, a meshing strategy was introduced to reduce the data dispersion coefficient of the data division and improve anti-interference.Secondly, to determine the number of clusters, meshing strategy based on potential function and Gaussian function were proposed, which formed an area with different sample points as the core clusters, and obtained the number of local clusters.Then, to avoid local cluster centroids randomness, ASPSO was proposed.Finally, a local cluster merging strategy based on cluster radius and neighbor nodes was introduced to merge clusters with large similarity based on the Spark parallel computing framework, which improved the efficiency of parallel merging of local clusters.Experimental results showed that the PDC-SFASPSO algorithm has good performance in data partitioning and clustering in a big data environment, and it was suitable for parallel clustering of large-scale data sets.

Key words: Spark framework, parallel division clustering, grid division, ASPSO, parallel merge

中图分类号: 

No Suggested Reading articles found!