Journal on Communications ›› 2022, Vol. 43 ›› Issue (3): 148-163.doi: 10.11959/j.issn.1000-436x.2022054

• Papers • Previous Articles     Next Articles

Parallel division clustering algorithm based on Spark framework and ASPSO

Yimin MAO1, Dejin GAN1, Liefa LIAO1, Zhigang CHEN2   

  1. 1 School of Information Engineering, Jiangxi University of Science and Technology, Ganzhou 341000, China
    2 College of Computer Science and Engineering, Central South University, Changsha 410083, China
  • Revised:2021-12-10 Online:2022-03-25 Published:2022-03-01
  • Supported by:
    The National Natural Science Foundation of China(41562019);Technological Innovation 2030-Next-Generation Artificial Intelligence Major Projects(2020AAA0109605)

Abstract:

To deal with the problems that the partition clustering algorithm for processing massive data encountered problems such as large data dispersion coefficient and poor anti-interference, difficulty to determine the number of local clusters, local cluster centroids randomness, and low efficiency of local cluster parallelization and merging, a parallel partition clustering algorithm based on Spark framework and ASPSO (PDC-SFAS PSO) was proposed.Firstly, a meshing strategy was introduced to reduce the data dispersion coefficient of the data division and improve anti-interference.Secondly, to determine the number of clusters, meshing strategy based on potential function and Gaussian function were proposed, which formed an area with different sample points as the core clusters, and obtained the number of local clusters.Then, to avoid local cluster centroids randomness, ASPSO was proposed.Finally, a local cluster merging strategy based on cluster radius and neighbor nodes was introduced to merge clusters with large similarity based on the Spark parallel computing framework, which improved the efficiency of parallel merging of local clusters.Experimental results showed that the PDC-SFASPSO algorithm has good performance in data partitioning and clustering in a big data environment, and it was suitable for parallel clustering of large-scale data sets.

Key words: Spark framework, parallel division clustering, grid division, ASPSO, parallel merge

CLC Number: 

No Suggested Reading articles found!