大数据中效用挖掘的快速单阶段算法

doi:10.11959/j.issn.1000-0801.2015100

摘要/Abstract

摘要：

现有数据挖掘算法的缺点是在挖掘大数据时会出现大量候选模式，从而造成可伸缩性瓶颈，个别算法虽然不生成候选模式，但是计算代价高昂，缺乏有效剪裁，运行效率存在瓶颈。为此，提出一个全新的单阶段不生成候选模式的数据挖掘算法，其创新性有3点：一是基于前缀生长的模式枚举和基于效用上限值评估的剪裁策略；二是基于稀疏矩阵和虚拟投影的效用信息表达；三是节省存储空间的深度优先搜索方法。大量实验表明，新算法的时间效率比现有算法高5倍以上，并且内存使用量比现有算法少20%～60%，可伸缩性高。

关键词: 大数据, 效用挖掘, 高效用模式, 频繁模式

Abstract:

Most of the latest works on utility mining generates a huge number of candidates in dealing with big data,which suffers from the scalability issue.Some work does not generate candidates,but suffers from the efficiency issue due to lack of strong pruning and high computation overhead.A novel algorithm that finds high utility patterns in a single phase without generating candidates was proposed.The novelties lie in a prefix growth strategy with strong pruning,and a sparse matrix based representation of transactions with pseudo projection.The proposed algorithm works in a depth first manner and does not materialize high utility patterns in memory,which further improves the scalability.Extensive experiments on synthetic and rea1-world data show that the proposed algorithm outperforms the latest works in terms of running time,memory overhead,and scalability.

Key words: big data, utility mining, high utility pattern, frequent pattern

刘君强,周青峰,王文慧,时磊. 大数据中效用挖掘的快速单阶段算法[J]. 电信科学, 2015, 31(4): 77-85.

Junqiang Liu,Qingfeng Zhou,Wenhui Wang,Lei Shi. Fast Single Pbase Algoritbm for Utility Mining in Big Data[J]. Telecommunications Science, 2015, 31(4): 77-85.

图/表 9

表1

表2

图1

图2

图3

图4

图5

图6

图7

参考文献 15

1	Ahmed C F , Tanbeer S K , Jeong B S , et al. Efficient tree structures for high utility pattern mining in incremental databases. IEEE Transactions on Knowledge and Data Engineering, 2009,21（12）: 1708～1721
2	Erwin A , Gopalan R P , Achuthan N R . Efficient mining of high utility itemsets from large datasets. Proceedings of PAKDD, Osaka,Japan, 2008
3	Li Y C , Yeh J S , Chang C C . Isolated items discarding strategy for discovering high utility itemsets. Data ＆ Knowledge Engineering, 2008,64（1）: 98～217
4	Liu Y , Liao W , Choudhary A . A fast high utility itemsets mining algorithm. Proceedings of the Utility-Based Data Mining Workshop in Conjunction With the 11th ACM SIGKDD, Chicago,Illinois,USA, 2005
5	Tseng V S , Shie B E , Wu C W , et al. Efficient algorithms for mining high utility itemsets from transactional databases. IEEE Transactions on Knowledge and Data Engineering, 2013,25（8）: 1772～1786
6	Yen S J , Lee Y S . Mining high utility quantitative association rules. Proceeding of the 9th International Conference on Data Warehousing and Knowledge Discovery, Regensburg,Germany, 2007
7	Yao H , Hamilton H J , Geng L . A unified framework for utility-based measures for mining itemsets. Proceedings of ACM SIGKDD the 2nd Workshop on Utility-Based Data Mining, Philadelphia,PA,USA, 2006
8	Agrawal R , Srikant R , Geng L . Fast algorithms for mining association rules. Proceedings of the 20th International Conference on Very Large Databases, Santiago,Chile, 1994
9	Han J , Pei J , Yin Y . Mining frequent patterns without candidate generation. Proceedings of ACM SIGMOD conference, Santiago,Chile, 1994
10	Liu J , Pan Y . An efficient algorithm for mining closed itemsets. Journal of Zhejiang University Science, 2004,5（1）: 8～15
11	Liu J , Pei Y , Wang K , et al. Mining frequent item sets by opportunistic projection.Proceedings of SIGKDD. Proceedings of SIGKDD, Edmonton,Canada, 2002
12	Shie B E , Cheng J H , Chuang K T , et al. A one-phase method for mining high utility mobile sequential patterns in mobile commerce environments. Proceedings of IEA/AIE12, Dalian,China, 2012
13	Wu C W , Lin Y F , Yu P S , et al. Mining high utility episodes in complex event sequences. Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago Illinois,USA, 2013
14	Wu C W , Shie B E , Tseng V S , et al. Mining top-K high utility itemsets. Proceedings of SIG KDD, Beijing,China, 2012
15	Liu M , Qu J . et al. Mining high utility itemsets without candidate generation. Proceedings of CIKM, Proceedings of CIKM, 2012

tid				item
tid	a	b	c	d	e	f	g
t₁	1		1		1
t₂	6	2	2			5
t₃	1	1	1	2	6		5
t₄	3	1		4	3
t₅	2	1		2		2

[1]	韩璐, 陈威宇, 张斐, 何建锋, 苏怀振. 差异化需求下的非关系型分布式报送信息大数据分类方法[J]. 电信科学, 2023, 39(6): 114-121.
[2]	韩雪. 基于电力大数据的电力骨干通信网络毁伤韧性评估方法[J]. 电信科学, 2023, 39(5): 136-143.
[3]	孙玉娣. 基于电信大数据的5G网络海量用户复访行为预测模型[J]. 电信科学, 2023, 39(2): 157-162.
[4]	李爱华, 吴晓波, 陈超, 魏彬, 史嫄嫄. 5G网络大数据智能分析技术[J]. 电信科学, 2022, 38(8): 129-139.
[5]	黄更生, 黄宇红, 郭漫雪, 郑健平, 葛欣. DT时代面向数据服务的新型基础设施架构[J]. 电信科学, 2022, 38(7): 138-145.
[6]	李攀攀, 谢正霞, 乐光学, 刘鑫. 基于深度学习的无线通信接收方法研究进展与趋势[J]. 电信科学, 2022, 38(2): 1-17.
[7]	赵海波, 相志军, 肖林松. 基于异构数据的电力短期负荷大数据预测方案[J]. 电信科学, 2022, 38(12): 103-111.
[8]	刘志勇, 何忠江, 阮宜龙, 单俊峰, 张超. 大数据安全特征与运营实践[J]. 电信科学, 2021, 37(5): 160-169.
[9]	刘晓军,武娟,徐晓青. 大数据架构剖析及数据安全融合技术[J]. 电信科学, 2020, 36(7): 146-155.
[10]	刘玉娇,宋坤煌,王向. 基于电力大数据的经济景气指数分析[J]. 电信科学, 2020, 36(6): 166-171.
[11]	程卫华,何肖嵘. 基于大数据的网页浏览质差分析方法研究[J]. 电信科学, 2020, 36(11): 174-181.
[12]	高建,陈文彬,庞建民,于华东,宋国瑞. 基于组合密钥的智能电网多源数据安全保护[J]. 电信科学, 2020, 36(1): 134-138.
[13]	谭晓敏,方艾,梁冰,杨豪杰. 基于决策树的EPG体验异常原因定位[J]. 电信科学, 2019, 35(8): 158-164.
[14]	鲁泽霖,李强治. 电子商务平台的演化逻辑和运营机理[J]. 电信科学, 2019, 35(7): 152-158.
[15]	廖明,陈明,周冀,向小华,李芳,焦叶芬. 基于大数据融合算法的DNS日志分析系统[J]. 电信科学, 2019, 35(5): 129-139.