基于信息熵与遗传算法的并行关联规则增量挖掘算法

doi:10.11959/j.issn.1000-436x.2021052

通信学报 ›› 2021, Vol. 42 ›› Issue (5): 122-136.doi: 10.11959/j.issn.1000-436x.2021052

基于信息熵与遗传算法的并行关联规则增量挖掘算法

毛伊敏¹, 邓千虎¹, 陈志刚²

¹ 江西理工大学信息工程学院，江西赣州 341000
² 中南大学计算机学院，湖南长沙 410083

修回日期:2021-02-04 出版日期:2021-05-25 发布日期:2021-05-01
作者简介:毛伊敏（1970- ），女，新疆伊犁人，博士，江西理工大学教授、硕士生导师，主要研究方向为数据挖掘、大数据安全与隐私保护
邓千虎（1998- ），男，湖北天门人，江西理工大学硕士生，主要研究方向为数据挖掘、大数据
陈志刚（1964- ），男，湖南益阳人，博士，中南大学教授、博士生导师，主要研究方向为网络与分布式计算、机会网络
基金资助:
国家自然科学基金资助项目(41562019);国家自然科学基金资助项目(61762046);国家重点研发计划基金资助项目(2018YFC1504705)

Parallel association rules incremental mining algorithm based on information entropy and genetic algorithm

Yimin MAO¹, Qianhu DENG¹, Zhigang CHEN²

¹ School of Information Engineering, Jiangxi University of Science and Technology, Ganzhou 341000, China
² College of Computer Science and Engineering, Central South University, Changsha 410083, China

Revised:2021-02-04 Online:2021-05-25 Published:2021-05-01
Supported by:
The National Natural Science Foundation of China(41562019);The National Natural Science Foundation of China(61762046);The National Key Research and Development Program of China(2018YFC1504705)

摘要/Abstract

摘要：

针对大数据环境下基于Can树的增量关联规则算法存在树结构空间占用过大、支持度阈值无法动态设置以及Map与Reduce阶段数据传输耗时等问题，提出了一种基于信息熵和遗传算法的并行关联规则增量挖掘算法MR-PARIMIEG。首先，该算法设计基于信息熵的相似项合并策略（SIM-IE）来合并相似数据项，并根据合并后的数据集进行Can树构造，从而减少树结构的空间占用；其次，提出基于遗传算法的DST-GA策略获取大数据环境下相对最优的动态支持度阈值，根据此阈值进行频繁项集挖掘，避免了冗余的频繁模式挖掘导致的时间消耗；最后，在MapReduce并行化运算过程中使用并行LZO数据压缩算法对Map端输出数据进行压缩，从而减少传输的数据规模，最终提升算法的运行速度。实验仿真结果表明，MR-PARIMIEG在大数据环境下进行频繁项集挖掘时具有较好的性能表现，适用于对较大规模的数据集进行并行化处理。

关键词: Can树, 信息熵, 大数据, 增量挖掘, 数据压缩

Abstract:

Aiming at the problems that in the big data environment, the Can-tree based incremental association rule algorithm had problems such as too much space occupation of the tree structure, inability to dynamically set the support threshold, and too much time consumption during the data transfer process between the Map and Reduce stages, the Map Reduce-based parallel association rules incremental mining algorithm using information entropy and genetic algorithm (MR-PARIMIEG)was proposed.Firstly, a similar items merging based on information entropy (SIM-IE) was designed to merge similar data items, and a Can tree based on the merged data set was constructed, thereby reducing the space occupation of the tree structure.Secondly, the dynamic support threshold obtaining using genetic algorithm (DST-GA) was proposed to obtain the relatively optimal dynamic support threshold in the big data environment, and frequent itemset mining was performed according to this threshold to avoid the unnecessary time consumption caused by mining redundant frequent patterns.Finally, in the process of MapReduce parallel operation, the parallel LZO data compression algorithm was used to compress the output data of the Map stage, thereby reducing the size of the transmitted data, and finally improving the running speed of the algorithm.Experimental simulation results show that MR-PARIMIEG has better performance when mining frequent item sets in the big data environment, and it is suitable for parallel processing of larger data sets.

Key words: Can-tree, information entropy, big data, incremental mining, data compression

中图分类号:

TP311

毛伊敏, 邓千虎, 陈志刚. 基于信息熵与遗传算法的并行关联规则增量挖掘算法[J]. 通信学报, 2021, 42(5): 122-136.

Yimin MAO, Qianhu DENG, Zhigang CHEN. Parallel association rules incremental mining algorithm based on information entropy and genetic algorithm[J]. Journal on Communications, 2021, 42(5): 122-136.

图/表 11

表1

图1

图2

图3

表2

图4

图5

图6

图7

图8

图9

参考文献 20

[1]	HAND D J , ADAMS N M . Data mining[EB].(2015-06-22)[2020-11-13].
[2]	KAMSU-FOGUEM B , RIGAL F , MAUGET F . Mining association rules for the quality improvement of the production process[J]. Expert Systems WithApplications, 2013,40(4): 1034-1045.
[3]	SáNCHEZ D , VILA M A , CERDA L ,et al. Association rules applied to credit card fraud detection[J]. Expert Systems WithApplications, 2009,36(2): 3630-3640.
[4]	BHANDARI A , GUPTA A , DAS D . Improvised apriori algorithm using frequent pattern tree for real time applications in data mining[J]. Procedia Computer Science, 2015,46: 644-651.
[5]	ZHANG W , LIAO H Z , ZHAO N . Research on the FP growth algorithm about association rule mining[C]// 2008 International Seminar on Business and Information Management. Piscataway:IEEE Press, 2008: 315-318.
[6]	LI Z F , LIU X F , CAO X . A study on improved eclat data mining algorithm[J]. Advanced Materials Research, 2011,328: 1896-1899.
[7]	LEUNG C K S , KHAN Q I , HOQUE T . CanTree:a tree structure for efficient incremental mining of frequent patterns[C]// Fifth IEEE International Conference on Data Mining. Piscataway:IEEE Press, 2005: 1-8.
[8]	KUSUMAKUMARI V , SHERIGAR D , CHANDRAN R ,et al. Frequent pattern mining on stream data using Hadoop CanTree-GTree[J]. Procedia Computer Science, 2017,115: 266-273.
[9]	DEAN J , GHEMAWAT S . MapReduce:simplified data processing on large clusters[J]. Communications of the ACM, 2008,51(1): 107-113.
[10]	SONG Y G , CUI H M , FENG X B . Parallel incremental frequent itemset mining for large data[J]. Journal of Computer Science and Technology, 2017,32(2): 368-385.
[11]	胡军, 潘皓安 . 基于Can树的关联规则增量更新算法改进[J]. 重庆邮电大学学报(自然科学版), 2018,30(4): 558-563.
	HU J , PAN H A . Improved incremental updating algorithm of association rules based on Can-tree[J]. Journal of Chongqing University of Posts and Telecommunications (Natural Science Edition), 2018,30(4): 558-563.
[12]	RAGAVENTHIRAN J , KAVITHADEVI M K . Map-optimize-reduce:CAN tree assisted FP-growth algorithm for clusters based FP mining on Hadoop[J]. Future Generation Computer Systems, 2020,103: 111-122.
[13]	申玲艳 . MapReduce计算模式的性能优化设计及其应用[J]. 信息与电脑(理论版), 2016(14): 49-50.
	SHEN L Y . MapReduce computing model designed to performance optimization and applications[J]. China Computer ＆ Communication, 2016(14): 49-50.
[14]	CAO Y , WANG H . Communication optimisation for intermediate data of MapReduce computing model[J]. International Journal of Computational Science and Engineering, 2020,21(2): 226-233.
[15]	KIM J , HWANG B . Real-time stream data mining based on CanTree and Gtree[J]. Information Sciences, 2016,367: 512-528.
[16]	LIANG J , SHI Z , LI D ,et al. Information entropy,rough entropy and knowledge granulation in incomplete information systems[J]. International Journal of General Systems, 2006,35(6): 641-654.
[17]	KOHLAS J , MONNEY P A . A mathematical theory of hints:an approach to the Dempster-Shafer theory of evidence[M]. Berlin: Springer Science ＆ Business Media, 2013.
[18]	KANE J , YANG Q . Compression speed enhancements to LZO for multi-core systems[C]// 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing. Piscataway:IEEE Press, 2012: 108-115.
[19]	METAWA N , HASSAN M K , ELHOSENY M . Genetic algorithm based model for optimizing bank lending decisions[J]. Expert Systems With Applications, 2017,80: 75-82.
[20]	BART G . Frequent itemset mining dataset repository[EB].(2020-03-01)[2020-11-13].

数据库	事务	数据项
	t₁	{a, b, d, g, e, c}
DB₁	t₂	{d, f, b, e, a}
	t₃	{a}
	t₄	{d, b, a}
DB₂	t₅	{a, c, b}
	t₆	{a, c, b, e}
DB₃	t₇	{a, c, b}
	t₈	{a, b, d, e, f}

数据集	记录数/条	数据项数/个	规模/MB
RetailRocket	2 756 101	5	299.6
Accident	340 183	63	35.5
Susy	5 000 000	28	880.5
Jester	4 100 000	13	96

基于信息熵与遗传算法的并行关联规则增量挖掘算法

Parallel association rules incremental mining algorithm based on information entropy and genetic algorithm

在线阅读

PDF下载

可视化

摘要/Abstract

引用本文

使用本文

图/表 11

参考文献 20

相关文章 15

Metrics

推荐阅读 0

[1]	金伟, 李凤华, 余铭洁, 郭云川, 周紫妍, 房梁. 面向HDFS的密钥资源控制机制[J]. 通信学报, 2022, 43(9): 27-41.
[2]	张忠平, 刘伟雄, 张玉停, 邓禹, 魏棉鑫. ERDOF：基于相对熵权密度离群因子的离群点检测算法[J]. 通信学报, 2021, 42(9): 133-143.
[3]	袁亮, 俞啸, 丁恩杰, 赵小虎, 冯仕民, 张达, 刘统玉, 王卫东, 黄艳秋. 矿山物联网人-机-环状态感知关键技术研究[J]. 通信学报, 2020, 41(2): 1-12.
[4]	杨鹏,李幼平. 支持内容智能治理的双结构互联网[J]. 通信学报, 2019, 40(9): 1-14.
[5]	蒲勇霖,于炯,鲁亮,李梓杨,卞琛,廖彬. 基于Storm平台的数据迁移合并节能策略[J]. 通信学报, 2019, 40(12): 68-85.
[6]	付钰, 俞艺涵, 吴晓平. 大数据环境下差分隐私保护技术及应用[J]. 通信学报, 2019, 40(10): 157-168.
[7]	俞艺涵,付钰,吴晓平. 基于Shannon信息熵与BP神经网络的隐私数据度量与分级模型[J]. 通信学报, 2018, 39(12): 10-17.
[8]	蒲勇霖,于炯,鲁亮,卞琛,廖彬,李梓杨. storm平台下工作节点的内存电压调控节能策略[J]. 通信学报, 2018, 39(10): 97-117.
[9]	张坤芳,鲁鸣鸣,郑林. 大数据驱动的地铁众包快递系统[J]. 通信学报, 2017, 38(Z2): 99-112.
[10]	沙乐天,肖甫,陈伟,孙晶,王汝传. 基于多属性决策及污点跟踪的大数据平台敏感信息泄露感知方法[J]. 通信学报, 2017, 38(7): 56-69.
[11]	孙鹏浩,兰巨龙,张少军,李军飞. 基于信息熵的匹配域裁剪算法[J]. 通信学报, 2017, 38(5): 182-189.
[12]	沈俊鑫,陈颖谦. 面向欠发达地区大数据产业发展能力分析的网络化方法研究[J]. 通信学报, 2017, 38(12): 153-159.
[13]	金鑫,李龙威,季佳男,李祉歧,胡宇,赵永彬. 基于大数据和优化神经网络短期电力负荷预测[J]. 通信学报, 2016, 37(Z1): 36-42.
[14]	张琳,刘彦,王汝传. 位置大数据服务中基于差分隐私的数据发布技术[J]. 通信学报, 2016, 37(9): 46-54.
[15]	耿少峰,王永恒,李仁发,张佳. 主动式复杂事件处理方法的研究[J]. 通信学报, 2016, 37(9): 111-120.