基于节点属性与正文内容的海量Web信息抽取方法

doi:10.11959/j.issn.1000-436x.2016190

Abstract

Abstract:

To address the problem of extracting valuable information from massive Web pages in big data environments,a novel information extraction method based on node property and text content for massive Web pages was put forward.Web pages were converted into a document object model (DOM) tree,and a pruning and fusion algorithm was introduced to simplify the DOM tree.For each node in the DOM tree,both density property and vision property was defined and Web pages were pretreated based on these property values.A MapReduce framework was employed to realize parallel information extraction from massive Web pages.Simulation and experimental results demonstrate that the proposed extraction method can not only achieve better performance but also have higher scalability compared with other methods.

Key words: Web information, extraction, MapReduce, DOM tree

Hai-yan WANG,Pan CAO. Information extraction from massive Web pages based on node property and text content[J]. Journal on Communications, 2016, 37(10): 9-17.

Figures/Tables 6

References 14

[1]	GRISHMAN R Information extraction:techniques and challenges [EB/OL]. .
[2]	李蕾, 周延泉, 王菁华 . 基于全信息的中文信息抽取系统及应用[J]. 北京邮电大学学报, 2005,28(6): 48-51. LI L , ZHOU Y Q , WANG J H . Comprehensive information based chinese information extraction system and application[J]. Journal of Beijing University of Posts and Telecommunications, 2005,28(6): 48-51.
[3]	黄诗琳, 郑小琳, 陈德人 . 针对产品命名实体识别的半监督学习方法[J]. 北京邮电大学学报, 2013,36(2): 20-23. HUANG S L , ZHENG X L , CHEN D R . A semi-supervised learning method for product named entity recognition[J]. Journal of Beijing University of Posts and Telecommunications, 2013,36(2): 20-23.
[4]	秦兵, 刘安安, 刘挺 . 无指导的中文开放式实体关系抽取[J]. 计算机研究与发展, 2015,52(5): 1029-1035. QIN B , LIU A A , LIU T . Unsupervised Chinese open entity relation extraction[J]. Journal of Computer Research and Development, 2015,52(5): 1029-1035.
[5]	李天颍, 刘璘, 赵德旺 ,等. 一种基于依存文法的需求文本策略依赖关系抽取方法[J]. 计算机学报, 2013,31(1): 54-62. LI T Y , LIU L , ZHAO D W ,et al. Eliciting relations from requirements text based on dependency analysis[J]. Journal of Computers, 2013,31(1): 54-62.
[6]	DENG C , YU S P , WEN J R . VIPS:a vision-based page segmentation[R]// Microsoft Technical Report,MSR-TR_ 203-79, 2003.
[7]	NEIL A , HONG J . Visually extracting data records from the deepWeb[C]// WWW 2013. Rio,IEEE Press, 2013: 1233-1238.
[8]	NARWAL N , . Improving Web data extraction by noise removal[C]// ARTCom 2013. Bangalore,IET, 2013: 388-395.
[9]	SUN F , SONG D , LIAO L . DOM based content extraction via text density[C]// ACM SIGIR 2011. Beijing, 2011: 245-254.
[10]	张乃洲, 曹薇, 李石君 . 一种基于节点密度分割和标签传播的Web页面挖掘方法[J]. 计算机学报, 2015,38(2): 349-364. ZHANG N Z , CAO W , LI S J . A method based on node density segmentation and label propagation for mining Web page[J]. Journal of Computers, 2015,38(2): 349-364.
[11]	WANG J B , WANG L Z , GAO W L ,et al. Chinese Web content extraction based on naive bayes model[C]// International Federation for Information Processing IFIP. 2014: 404-413.
[12]	KRISHNA S S , DATTATRAYA J S . Schema inference and data extraction from templatized Web pages[C]// ICPC, 2015: 1-6.
[13]	BHUIYAN M A , ALHASAN M . FSM-H:frequent subgraph mining algorithm in Hadoop[C]// Big Data. 2014: 9-16.
[14]	JIN S Y , BOULWARE D , KIMMEY D . A parallel spatial co-location mining algorithm based on MapReduce[C]// Big Data. 2014: 25-31.

Metrics

Recommended 0

No Suggested Reading articles found!

页面数量	每页平均节点数		融合率	平均融合率
页面数量	处理前	处理后	融合率	平均融合率
30	1 653	603	63.5%
60	1 820	720	60.4%
90	1 673	659	60.6%
120	1 683	647	61.6%	62.8%
150	1 745	635	63.6%
180	1 670	603	63.9%
210	1 735	620	64.3%
240	1 698	597	64.8%

页面数量		本文方法			文献[10]方法			文献[11]方法
页面数量	P	R	F₁	P	R	F₁	P	R	F₁
30	0.89	0.94	0.91	0.83	0.94	0.88	0.78	0.96	0.86
60	0.92	0.95	0.93	0.84	0.96	0.90	0.77	0.96	0.85
90	0.90	0.96	0.92	0.84	0.96	0.89	0.80	0.95	0.87
120	0.91	0.96	0.93	0.82	0.95	0.88	0.76	0.97	0.85
150	0.90	0.95	0.92	0.80	0.95	0.87	0.75	0.95	0.84
180	0.93	0.96	0.95	0.81	0.96	0.88	0.75	0.96	0.84
210	0.92	0.95	0.93	0.81	0.95	0.87	0.76	0.95	0.84
240	0.91	0.96	0.93	0.81	0.95	0.87	0.76	0.96	0.85

节点数	总时间/s	平均时间/s	加速比
1	46 875	4.69	1
2	23 721	2.36	1.99
4	12 460	1.20	3.96
6	8 343	0.83	5.85
8	6 154	0.62	7.78

节点数	页面量	总时间/s	平均时间/s
	100	65	0.65
	500	318	0.64
8	2 000	1 265	0.63
	5 000	3 157	0.63
	8 000	5 044	0.63

[1]	Rongpeng LI, Bingyan WANG, Honggang ZHANG, Zhifeng ZHAO. Design of knowledge enhanced semantic communication receiver [J]. Journal on Communications, 2023, 44(6): 70-76.
[2]	Yuling LIU, Cuilin WANG, Zhangjie FU. Generative text steganography method based on emotional expression in semantic space [J]. Journal on Communications, 2023, 44(4): 176-186.
[3]	Feibo JIANG, Yubo PENG, Li DONG. Deep image semantic communication model for 6G [J]. Journal on Communications, 2023, 44(3): 198-208.
[4]	Yuanbo GUO, Yongfei LI, Qingli CHEN, Chen FANG, Yangyang HU. Fusion of Focal Loss’s cyber threat intelligence entity extraction [J]. Journal on Communications, 2022, 43(7): 85-92.
[5]	Xiuzhang YANG, Guojun PENG, Zichuan LI, Yangqi LYU, Side LIU, Chenguang LI. Research on entity recognition and alignment of APT attack based on Bert and BiLSTM-CRF [J]. Journal on Communications, 2022, 43(6): 58-70.
[6]	Julong LAN, Di ZHU, Dan LI. Intelligent prediction method of virtual network function resource capacity for polymorphic network service slicing [J]. Journal on Communications, 2022, 43(6): 143-155.
[7]	Xiaodan WANG, Jingtai LI, Yafei SONG. DDAC: a feature extraction method for model of image steganalysis based on convolutional neural network [J]. Journal on Communications, 2022, 43(5): 68-81.
[8]	Jie LAI, Xiaodan WANG, Qian XIANG, Yafei SONG, Wen QUAN. Review on autoencoder and its application [J]. Journal on Communications, 2021, 42(9): 218-230.
[9]	Zhaojun WU, Limin ZHANG, Zhaogen ZHONG, Renxin LIU. Reconstruction of sparse check matrix for LDPC at high bit error rate [J]. Journal on Communications, 2021, 42(3): 1-10.
[10]	Xiaoli DAI, Shifeng LIU, Daqing GONG. Text similarity detection method based on NLP [J]. Journal on Communications, 2021, 42(10): 173-181.
[11]	Limin XIAO,Xiangrong XU,Zhuangkun WEI,Shenghan LIU,Yiwen LIU. Channel impulse response insensitive feature for non-coherent signal detection in molecular communication [J]. Journal on Communications, 2020, 41(9): 49-58.
[12]	Chunxiang GU,Weisen WU,Ya’nan SHI,Guangsong LI. Method of unknown protocol classification based on autoencoder [J]. Journal on Communications, 2020, 41(6): 88-97.
[13]	Boyang DU,Xiangyu KONG,Xiaowei FENG. Direction convergence analysis of weighted rule for minor component extraction information criteria [J]. Journal on Communications, 2020, 41(3): 25-32.
[14]	Tao LI,Yuanbo GUO,Ankang JU. Knowledge triple extraction in cybersecurity with adversarial active learning [J]. Journal on Communications, 2020, 41(10): 80-91.
[15]	Yonggong REN,Yunpeng ZHANG,Zhipeng ZHANG. Collaborative filtering recommendation algorithm based on rough set rule extraction [J]. Journal on Communications, 2020, 41(1): 76-83.

Information extraction from massive Web pages based on node property and text content

RichHTML

PDF

Knowledge

Cited

Abstract

Cite this article

share this article

Figures/Tables 6

References 14

Related Articles 15

Metrics

Recommended 0