通信学报 ›› 2021, Vol. 42 ›› Issue (5): 122-136.doi: 10.11959/j.issn.1000-436x.2021052

• 学术论文 • 上一篇    

基于信息熵与遗传算法的并行关联规则增量挖掘算法

毛伊敏1, 邓千虎1, 陈志刚2   

  1. 1 江西理工大学信息工程学院,江西 赣州 341000
    2 中南大学计算机学院,湖南 长沙 410083
  • 修回日期:2021-02-04 出版日期:2021-05-01 发布日期:2021-05-01
  • 作者简介:毛伊敏(1970- ),女,新疆伊犁人,博士,江西理工大学教授、硕士生导师,主要研究方向为数据挖掘、大数据安全与隐私保护
    邓千虎(1998- ),男,湖北天门人,江西理工大学硕士生,主要研究方向为数据挖掘、大数据
    陈志刚(1964- ),男,湖南益阳人,博士,中南大学教授、博士生导师,主要研究方向为网络与分布式计算、机会网络
  • 基金资助:
    国家自然科学基金资助项目(41562019);国家自然科学基金资助项目(61762046);国家重点研发计划基金资助项目(2018YFC1504705)

Parallel association rules incremental mining algorithm based on information entropy and genetic algorithm

Yimin MAO1, Qianhu DENG1, Zhigang CHEN2   

  1. 1 School of Information Engineering, Jiangxi University of Science and Technology, Ganzhou 341000, China
    2 College of Computer Science and Engineering, Central South University, Changsha 410083, China
  • Revised:2021-02-04 Online:2021-05-01 Published:2021-05-01
  • Supported by:
    The National Natural Science Foundation of China(41562019);The National Natural Science Foundation of China(61762046);The National Key Research and Development Program of China(2018YFC1504705)

摘要:

针对大数据环境下基于Can树的增量关联规则算法存在树结构空间占用过大、支持度阈值无法动态设置以及Map与Reduce阶段数据传输耗时等问题,提出了一种基于信息熵和遗传算法的并行关联规则增量挖掘算法MR-PARIMIEG。首先,该算法设计基于信息熵的相似项合并策略(SIM-IE)来合并相似数据项,并根据合并后的数据集进行Can树构造,从而减少树结构的空间占用;其次,提出基于遗传算法的DST-GA策略获取大数据环境下相对最优的动态支持度阈值,根据此阈值进行频繁项集挖掘,避免了冗余的频繁模式挖掘导致的时间消耗;最后,在MapReduce并行化运算过程中使用并行LZO数据压缩算法对Map端输出数据进行压缩,从而减少传输的数据规模,最终提升算法的运行速度。实验仿真结果表明,MR-PARIMIEG在大数据环境下进行频繁项集挖掘时具有较好的性能表现,适用于对较大规模的数据集进行并行化处理。

关键词: Can树, 信息熵, 大数据, 增量挖掘, 数据压缩

Abstract:

Aiming at the problems that in the big data environment, the Can-tree based incremental association rule algorithm had problems such as too much space occupation of the tree structure, inability to dynamically set the support threshold, and too much time consumption during the data transfer process between the Map and Reduce stages, the Map Reduce-based parallel association rules incremental mining algorithm using information entropy and genetic algorithm (MR-PARIMIEG)was proposed.Firstly, a similar items merging based on information entropy (SIM-IE) was designed to merge similar data items, and a Can tree based on the merged data set was constructed, thereby reducing the space occupation of the tree structure.Secondly, the dynamic support threshold obtaining using genetic algorithm (DST-GA) was proposed to obtain the relatively optimal dynamic support threshold in the big data environment, and frequent itemset mining was performed according to this threshold to avoid the unnecessary time consumption caused by mining redundant frequent patterns.Finally, in the process of MapReduce parallel operation, the parallel LZO data compression algorithm was used to compress the output data of the Map stage, thereby reducing the size of the transmitted data, and finally improving the running speed of the algorithm.Experimental simulation results show that MR-PARIMIEG has better performance when mining frequent item sets in the big data environment, and it is suitable for parallel processing of larger data sets.

Key words: Can-tree, information entropy, big data, incremental mining, data compression

中图分类号: 

No Suggested Reading articles found!