基于信息熵与遗传算法的并行关联规则增量挖掘算法

doi:10.11959/j.issn.1000-436x.2021052

Abstract

Abstract:

Aiming at the problems that in the big data environment, the Can-tree based incremental association rule algorithm had problems such as too much space occupation of the tree structure, inability to dynamically set the support threshold, and too much time consumption during the data transfer process between the Map and Reduce stages, the Map Reduce-based parallel association rules incremental mining algorithm using information entropy and genetic algorithm (MR-PARIMIEG)was proposed.Firstly, a similar items merging based on information entropy (SIM-IE) was designed to merge similar data items, and a Can tree based on the merged data set was constructed, thereby reducing the space occupation of the tree structure.Secondly, the dynamic support threshold obtaining using genetic algorithm (DST-GA) was proposed to obtain the relatively optimal dynamic support threshold in the big data environment, and frequent itemset mining was performed according to this threshold to avoid the unnecessary time consumption caused by mining redundant frequent patterns.Finally, in the process of MapReduce parallel operation, the parallel LZO data compression algorithm was used to compress the output data of the Map stage, thereby reducing the size of the transmitted data, and finally improving the running speed of the algorithm.Experimental simulation results show that MR-PARIMIEG has better performance when mining frequent item sets in the big data environment, and it is suitable for parallel processing of larger data sets.

Key words: Can-tree, information entropy, big data, incremental mining, data compression

CLC Number:

TP311

Yimin MAO, Qianhu DENG, Zhigang CHEN. Parallel association rules incremental mining algorithm based on information entropy and genetic algorithm[J]. Journal on Communications, 2021, 42(5): 122-136.

Figures/Tables 11

References 20

[1]	HAND D J , ADAMS N M . Data mining[EB].(2015-06-22)[2020-11-13].
[2]	KAMSU-FOGUEM B , RIGAL F , MAUGET F . Mining association rules for the quality improvement of the production process[J]. Expert Systems WithApplications, 2013,40(4): 1034-1045.
[3]	SáNCHEZ D , VILA M A , CERDA L ,et al. Association rules applied to credit card fraud detection[J]. Expert Systems WithApplications, 2009,36(2): 3630-3640.
[4]	BHANDARI A , GUPTA A , DAS D . Improvised apriori algorithm using frequent pattern tree for real time applications in data mining[J]. Procedia Computer Science, 2015,46: 644-651.
[5]	ZHANG W , LIAO H Z , ZHAO N . Research on the FP growth algorithm about association rule mining[C]// 2008 International Seminar on Business and Information Management. Piscataway:IEEE Press, 2008: 315-318.
[6]	LI Z F , LIU X F , CAO X . A study on improved eclat data mining algorithm[J]. Advanced Materials Research, 2011,328: 1896-1899.
[7]	LEUNG C K S , KHAN Q I , HOQUE T . CanTree:a tree structure for efficient incremental mining of frequent patterns[C]// Fifth IEEE International Conference on Data Mining. Piscataway:IEEE Press, 2005: 1-8.
[8]	KUSUMAKUMARI V , SHERIGAR D , CHANDRAN R ,et al. Frequent pattern mining on stream data using Hadoop CanTree-GTree[J]. Procedia Computer Science, 2017,115: 266-273.
[9]	DEAN J , GHEMAWAT S . MapReduce:simplified data processing on large clusters[J]. Communications of the ACM, 2008,51(1): 107-113.
[10]	SONG Y G , CUI H M , FENG X B . Parallel incremental frequent itemset mining for large data[J]. Journal of Computer Science and Technology, 2017,32(2): 368-385.
[11]	胡军, 潘皓安 . 基于Can树的关联规则增量更新算法改进[J]. 重庆邮电大学学报(自然科学版), 2018,30(4): 558-563.
	HU J , PAN H A . Improved incremental updating algorithm of association rules based on Can-tree[J]. Journal of Chongqing University of Posts and Telecommunications (Natural Science Edition), 2018,30(4): 558-563.
[12]	RAGAVENTHIRAN J , KAVITHADEVI M K . Map-optimize-reduce:CAN tree assisted FP-growth algorithm for clusters based FP mining on Hadoop[J]. Future Generation Computer Systems, 2020,103: 111-122.
[13]	申玲艳 . MapReduce计算模式的性能优化设计及其应用[J]. 信息与电脑(理论版), 2016(14): 49-50.
	SHEN L Y . MapReduce computing model designed to performance optimization and applications[J]. China Computer ＆ Communication, 2016(14): 49-50.
[14]	CAO Y , WANG H . Communication optimisation for intermediate data of MapReduce computing model[J]. International Journal of Computational Science and Engineering, 2020,21(2): 226-233.
[15]	KIM J , HWANG B . Real-time stream data mining based on CanTree and Gtree[J]. Information Sciences, 2016,367: 512-528.
[16]	LIANG J , SHI Z , LI D ,et al. Information entropy,rough entropy and knowledge granulation in incomplete information systems[J]. International Journal of General Systems, 2006,35(6): 641-654.
[17]	KOHLAS J , MONNEY P A . A mathematical theory of hints:an approach to the Dempster-Shafer theory of evidence[M]. Berlin: Springer Science ＆ Business Media, 2013.
[18]	KANE J , YANG Q . Compression speed enhancements to LZO for multi-core systems[C]// 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing. Piscataway:IEEE Press, 2012: 108-115.
[19]	METAWA N , HASSAN M K , ELHOSENY M . Genetic algorithm based model for optimizing bank lending decisions[J]. Expert Systems With Applications, 2017,80: 75-82.
[20]	BART G . Frequent itemset mining dataset repository[EB].(2020-03-01)[2020-11-13].

Metrics

Recommended 0

No Suggested Reading articles found!

数据库	事务	数据项
	t₁	{a, b, d, g, e, c}
DB₁	t₂	{d, f, b, e, a}
	t₃	{a}
	t₄	{d, b, a}
DB₂	t₅	{a, c, b}
	t₆	{a, c, b, e}
DB₃	t₇	{a, c, b}
	t₈	{a, b, d, e, f}

数据集	记录数/条	数据项数/个	规模/MB
RetailRocket	2 756 101	5	299.6
Accident	340 183	63	35.5
Susy	5 000 000	28	880.5
Jester	4 100 000	13	96

Parallel association rules incremental mining algorithm based on information entropy and genetic algorithm

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 11

References 20

Related Articles 15

Metrics

Recommended 0

[1]	Wei JIN, Fenghua LI, Mingjie YU, Yunchuan GUO, Ziyan ZHOU, Liang FANG. HDFS-oriented cryptographic key resource control mechanism [J]. Journal on Communications, 2022, 43(9): 27-41.
[2]	Zhongping ZHANG, Weixiong LIU, Yuting ZHANG, Yu DENG, Mianxin WEI. ERDOF: outlier detection algorithm based on entropy weight distance and relative density outlier factor [J]. Journal on Communications, 2021, 42(9): 133-143.
[3]	Liang YUAN, Xiao YU, Enjie DING, Xiaohu ZHAO, Shimin FENG, Da ZHANG, Tongyu LIU, Weidong WANG, Yanqiu HUANG. Research on key technologies of human-machine-environment states perception in mine Internet of things [J]. Journal on Communications, 2020, 41(2): 1-12.
[4]	. Dual-architecture Internet supporting intelligent governance of cyber content [J]. Journal on Communications, 2019, 40(9): 1-14.
[5]	Yonglin PU,Jiong YU,Liang LU,Ziyang LI,Chen BIAN,Bin LIAO. Energy-efficient strategy for data migration and merging in Storm [J]. Journal on Communications, 2019, 40(12): 68-85.
[6]	Yu FU, Yihan YU, Xiaoping WU. Differential privacy protection technology and its application in big data environment [J]. Journal on Communications, 2019, 40(10): 157-168.
[7]	Yihan YU,Yu FU,Xiaoping WU. Metric and classification model for privacy data based on Shannon information entropy and BP neural network [J]. Journal on Communications, 2018, 39(12): 10-17.
[8]	Yonglin PU,Jiong YU,Liang LU,Chen BIAN,Bin LIAO,Ziyang LI. Energy-efficient strategy for work node by DRAM voltage regulation in storm [J]. Journal on Communications, 2018, 39(10): 97-117.
[9]	Kun-fang ZHANG,Ming-ming LU,Lin ZHENG. Big data based metro crowd delivery system [J]. Journal on Communications, 2017, 38(Z2): 99-112.
[10]	Le-tian SHA,Fu XIAO,Wei CHEN,Jing SUN,Ru-chuan WANG. Sensitive information leakage awareness method for big data platform based on multi-attributes decision-making and taint tracking [J]. Journal on Communications, 2017, 38(7): 56-69.
[11]	Peng-hao SUN,Ju-long LAN,Shao-jun ZHANG,Jun-fei LI. Information entropy based match field cutting algorithm [J]. Journal on Communications, 2017, 38(5): 182-189.
[12]	Jun-xin SHEN,Ying-qian CHEN. Research on network analysis method for development ability of big data industry in underdeveloped area [J]. Journal on Communications, 2017, 38(12): 153-159.
[13]	Lin ZHANG,Yan LIU,Ru-chuan WANG. Location publishing technology based on differential privacy-preserving for big data services [J]. Journal on Communications, 2016, 37(9): 46-54.
[14]	Shao-feng GENG,Yong-heng WANG,Ren-fa LI,Jia ZHANG. Research of proactive complex event processing method [J]. Journal on Communications, 2016, 37(9): 111-120.
[15]	Fei ZHU,Zhi-peng XU,Quan LIU,Yu-chen FU,Hui WANG. Online hierarchical reinforcement learning based on interrupting Option [J]. Journal on Communications, 2016, 37(6): 65-74.