基于属性值序列图模型的deep Web新数据发现策略

doi:10.11959/j.issn.1000-436x.2016049

通信学报 ›› 2016, Vol. 37 ›› Issue (3): 20-32.doi: 10.11959/j.issn.1000-436x.2016049

基于属性值序列图模型的deep Web新数据发现策略

崔志明^1,²,赵朋朋²,鲜学丰^1,^2,³,方立刚^1,³,杨元峰^1,³,顾才东^1,³

¹ 江苏省现代企业信息化应用支撑软件工程技术研发中心，江苏苏州215104
² 苏州大学智能信息处理及应用研究所，江苏苏州215006
³ 苏州市职业大学计算机工程学院，江苏苏州215104

出版日期:2016-03-25 发布日期:2017-08-04
基金资助:
国家自然科学基金资助项目;国家自然科学基金资助项目;国家自然科学基金资助项目;江苏省自然科学基金资助项目;苏州市科技计划基金资助项目;苏州市科技计划基金资助项目;苏州市科技计划基金资助项目

Deep Web new data discovery strategy based on the graph model of data attribute value lists

Zhi-ming CUI^1,²,Peng-peng ZHAO²,Xue-feng XIAN^1,^2,³,Li-gang FANG^1,³,Yuan-feng YANG^1,³,Cai-dong GU^1,³

¹ Jiangsu Province Support Software Engineering R＆D Center for Modern Information Technology Application in Enterprise,Suzhou 215104,China
² Institute of Intelligent Information Processing and Application,Soochow University,Suzhou 215006,China
³ School of Computer Engineering,Suzhou Vocational University,Suzhou 215104,China

Online:2016-03-25 Published:2017-08-04
Supported by:
The National Natural Science Foundation of China;The National Natural Science Foundation of China;The National Natural Science Foundation of China;The Natural Science Foundation of Jiangsu Province;Suzhou Foundation for Development of Science and Technology;Suzhou Foundation for Development of Science and Technology;Suzhou Foundation for Development of Science and Technology

摘要/Abstract

摘要：

针对数据源新产生数据记录的增量爬取问题，提出了一种deep Web 新数据发现策略，该策略采用一种新的属性值序列图模型表示deep Web 数据源，将新数据发现问题转化为属性值序列图的遍历问题，该模型仅与数据相关，与现有查询关联图模型相比，具有更强的适应性和确定性，可适用于仅仅包含简单查询接口的deep Web数据源。在此模型的基础上，发现增长节点并预测其新数据发现能力；利用互信息计算节点之间的依赖关系，查询选择时尽可能地降低查询依赖带来的负面影响。该策略提高了新数据爬取的效率，实验结果表明，在相同资源约束前提下，该策略能使本地数据和远程数据保持最大化同步。

关键词: deepWeb, 新数据发现, 数据获取

Abstract:

A novel deep Web data discovery strategy was proposed for new generated data record in resources.In the ap-proach,a new graph model of deep Web data attribute value lists was used to indicate the deep Web data source,an new data crawling task was transformed into a graph traversal process.This model was only related to the data,compared with the ex-isting query-related graph model had better adaptability and certainty,applicable to contain only a simple query interface of deep Web data sources.Based on this model,which could discovery incremental nodes and predict new data mutual infor-mation was used to compute the dependencies between nodes.When the query selects,as much as possible to reduce the negative impact brought by the query-dependent.This strategy improves the data crawling efficiency.Experimental results show that this strategy could maximize the synchronization between local and remote data under the same restriction.

Key words: deep Web, new data discovery, data acquisition

崔志明,赵朋朋,鲜学丰,方立刚,杨元峰,顾才东. 基于属性值序列图模型的deep Web新数据发现策略[J]. 通信学报, 2016, 37(3): 20-32.

Zhi-ming CUI,Peng-peng ZHAO,Xue-feng XIAN,Li-gang FANG,Yuan-feng YANG,Cai-dong GU. Deep Web new data discovery strategy based on the graph model of data attribute value lists[J]. Journal on Communications, 2016, 37(3): 20-32.

图/表 7

图1

图2

图3

图4

图5

表1

图6

参考文献 15

[1]	MADHAVAN J , COHEN S , DONG X L , et al. Web-scale data inte-gration:you can afford to pay as you go[C]// The 3rd International Conference Innovative Data Systems Research. Asilomar,CA, c2007:342-350.
[2]	MADHAVAN J , KO D , KOT L , et al. Google's deep-Web crawl[C]// The 34th International Conference on Very Large Data Bases. Auckland,New Zealand,Springer, c2008:1241-1252.
[3]	PAVAI G , GEETHA T V . A unified architecture for surfacing the con-tents of deep Web databases[C]// International Conference on Advances in Communication. Network,and Computing,Chennai,India, c2013.
[4]	ANDREA C , DAVIDE M , RICCARDO T . Keyword search in the deep Web[C]// AMW2015 Alberto Mendelzon International Workshop on Foundations of Data Management. Lima Peru, c2015:205-208.
[5]	EDWARDS J , MCCURLEY K , TOMLIN J . An adaptive model for optimizing performance of an incremental Web crawler[C]// The 10th Conference on World Wide Web. Hong Kong,China, c2001:106-113.
[6]	SINGHAL N , DIXIT A , SHARMA A K . Design of a priority based frequency regulated incremental crawler[J]. International Journal of Computer Applications, 2010,1(1): 42-47.
[7]	JAGANATHAN P , KARTHIKEYAN T . Highly efficient architecture for scalable focused crawling using incremental parallel Web craw-ler[J]. Journal of Computer Science, 2015,11(1): 120-126.
[8]	LIU W , XIAO J G , YANG J W . Incremental structured Web database crawling via history versions[C]// The 11th International Conference on Web Information Systems Engineering. c2010:524-533.
[9]	LIU W , XIAO J G , YANG J W . A sample-guided approach to incre-mental structured Web database crawling[C]// International Conference on Information and Automation, Harbin, c2010:890-895.
[10]	HUANG Q Y , LI Q Z , LI H , et al. An approach to incremental deep Web crawling based on incremental harvest model[J]. Procedia Engi-neering, 2012,29:1081-1087.
[11]	ZHANG Z X , DONG G Q , PENG Z H , et al. A framework for incre-mental deep Web crawler based on URL classification[J]. Lecture Notes in Computer Science, 2011,6988:302-310.
[12]	张志潇 . 面向领域的Deep Web的增量爬取[D]. 济南：山东大学， 2012. ZHANG Z X . Domain-specific deep Web incremental crawler[D]. Ji-Nan: Shandong University, 2012.
[13]	YOGESH K , MANOJ K R , JITENDRA D . Novel approach for data source integration system update strategy in hidden Web[J]. Interna-tional Journal of Engineering Universe for Scientific Research and Management, 2015,2(7): 1-5.
[14]	徐国强 . 统计预测和决策[M]. 上海；上海财经大学出版社. 2008. XU G Q . Statistical forecasting and decision-making[M]. Shanghai: Shanghai University of Finance and Economics press. 2003.
[15]	WU P , WEN J R , LIU H , et al. Query selection techniques for efficient crawling of structured Web sources[C]// The 22th International Confe-rence on Data Engineering, Atlanta,GA,USA, c2006:47-56.

观察点	数据源	NR_Coverage	NR_Cost
	Sipo	92.5%	35.2%
	Mingzh	91.5%	38.5%
时间点t₇	Okbuy	87.7%	39.3%
	Soxieke	89.5%	36.6%
	Letao	87.1 %	38.6%
	Sipo	89.6%	32.5%
	Mingzh	90.2%	39.7%
时间点t₁₀	Okbuy	89.3%	35.8%
	Soxieke	91.%	38.7%
	Letao	88.9%	35.9%
	Sipo	90.6%	36.1%
	Mingzh	90.8%	39.3%
时间点t₁₅	Okbuy	88.5%	37.4%
	Soxieke	89.3%	38.9%
	Letao	87.5%	36.9%

基于属性值序列图模型的deep Web新数据发现策略

Deep Web new data discovery strategy based on the graph model of data attribute value lists

在线阅读

PDF下载

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 7

参考文献 15

相关文章 3

Metrics

推荐阅读 0

[1]	鲜学丰,崔志明,赵朋朋,梁颖红,方立刚. 基于循环策略和动态知识的deep Web数据获取方法[J]. 通信学报, 2012, 33(10): 35-43.
[2]	辛洁,崔志明,赵朋朋,张广铭,鲜学丰. 基于MapReduce虚拟机的Deep Web数据源发现方法[J]. 通信学报, 2011, 32(7): 189-195.
[3]	叶炜,顾宁. 在数据网格环境中可靠获取分布式数据的方法[J]. 通信学报, 2006, 27(11): 119-124.