Larbin体系结构的研究与优化

doi:10.11959/j.issn.2096-109x.2016.00076

网络与信息安全学报 ›› 2016, Vol. 2 ›› Issue (8): 74-83.doi: 10.11959/j.issn.2096-109x.2016.00076

• 学术论文 • 上一篇

Larbin体系结构的研究与优化

王璇^1,²,霍义霞¹,慈云飞¹,史国振¹,李莉^1,²()

¹ 北京电子科技学院信息安全系，北京 100070
² 西安电子科技大学计算机学院，陕西西安 710000

修回日期:2016-08-02 出版日期:2016-08-01 发布日期:2017-06-04
作者简介:王璇（1991-），女，山东菏泽人，西安电子科技大学硕士生，主要研究方向为多核调度。|霍义霞（1991-），女，河北廊坊人，北京电子科技学院硕士生，主要研究方向为网络安全。|慈云飞（1989-），男，安徽池州人，北京电子科技学院硕士生，主要研究方向为访问控制和信息安全。|史国振（1974-），男，河南济源人，博士，北京电子科技学院副教授、硕士生导师，主要研究方向为网络与系统安全、嵌入式安全。|李莉（1974-），女，山东青岛人，西安电子科技大学博士生，北京电子科技学院副教授、硕士生导师，主要研究方向为网络与系统安全、嵌入式系统安全应用。
基金资助:
国家重点研发计划基金资助项目(2016YFB0800304);北京市自然科学基金资助项目(4152048);江苏省自然科学基金资助项目(BK20150787);北京电子科技学院2016年春蕾计划基金资助项目(2016CL04)

Study and optimization on system architectures of Larbin

Xuan WANG^1,²,Yi-xia HUO¹,Yun-fei CI¹,Guo-zhen SHI¹,Li LI^1,²()

¹ School of Information Security,Beijing Electronic Science and Technology Institute,Beijing 100070,China
² School of Computer,Xidian University,Xi'an 710000,China

Revised:2016-08-02 Online:2016-08-01 Published:2017-06-04
Supported by:
The National Key Research Programof China(2016YFB0800304);he Natural Science Foundation of Beijing(4152048);The Natural Science Foundation of Jiangsu Province(BK20150787);2016 Spring Buds Project of Beijing Electronic Science＆Technology Institute(2016CL04)

摘要/Abstract

摘要：

网络爬虫是搜索引擎的重要组成部分，其性能直接影响搜索引擎的准确性和及时性。Larbin是一个高效、简单、功能比较完善的开源爬虫框架，基于此，介绍了几种典型的开源爬虫框架，并对其进行多维度比较；对Larbin体系结构进行详细的介绍；然后指出Larbin在程序结构和流程方面存在的不足，提出对应的优化方案；测试结果表明，改进后的方案在速度和性能方面都有所提高。

关键词: 搜索引擎, 网络爬虫, Larbin, 开源, 优化

Abstract:

Web crawler is an important part of the search engine,its performance will directly affect the accuracy and timeliness of the search engine.Larbin is an efficient and simple open source crawler with relatively perfect in functions.Several typical open-source crawler were firstly introduced and a multi-dimensional comparison was made among them.Then,the system architecture and working mechanism of Larbin were given in detail.Its short-comings in the program structure and process were pointed out,and improved programs were proposed.Experimen-tal results show that improved program is better in speed and performance.

Key words: search engine, Web crawler, Larbin, open source, optimization

中图分类号:

TP393

王璇,霍义霞,慈云飞,史国振,李莉. Larbin体系结构的研究与优化[J]. 网络与信息安全学报, 2016, 2(8): 74-83.

Xuan WANG,Yi-xia HUO,Yun-fei CI,Guo-zhen SHI,Li LI. Study and optimization on system architectures of Larbin[J]. Chinese Journal of Network and Information Security, 2016, 2(8): 74-83.

图/表 8

表1

图1

图2

图3

图4

表2

图5

图6

参考文献 13

[1]	BRIN S , PAGE L . Reprint of:the anatomy of a large-scale hyper-textual Web search engine[J]. Computer Networks, 2012, 56(18): 3825-3833.
[2]	孙骏雄 . 基于网络爬虫的网站信息采集技术研究[D]. 大连: 大连海事大学， 2014.
	SUN J X . The study on technology of website information collec-tion based on Web crawler[D]. Dalian: Dalian Maritime University, 2014.
[3]	吕阳 . 分布式网络爬虫系统的设计与实现[D]. 成都: 电子科技大学， 2013.
	LV Y . Distributed Web crawler system design and implementation[D]. Chengdu: University of Electronic Science and Technology of China, 2013.
[4]	单月光 . 基于微博的网络舆情关键技术的研究与实现[D]. 成都: 电子科技大学， 2013.
	SHAN Y G . The research and implementation of network public opinion's key techniques based on microblog[D]. Chengdu: University of Electronic Science and Technology of China, 2013.
[5]	李跃健，朱程荣 . 基于 Larbin 的网络爬虫体系结构的研究与改进[J]. 计算机技术与发展， 2012, 22(7): 147-150.
	LI Y J , ZHU C R . Study and improvement on system architectures of Larbin Web cawler[J]. Computer Technology and Development, 2012, 22(7): 147-150.
[6]	罗浩 . 基于CLucene和Larbin的企业搜索引擎的研究与实现[D]. 成都: 电子科技大学， 2010.
	LUO H . The research and implementation of enterprise search en-gine based on CLucene and Larbin[D]. Chengdu: University of Electronic Science and Technology of China, 2010.
[7]	杜一平 . 主题搜索网络爬虫的设计与研究[D]. 合肥: 中国科学技术大学， 2009.
	DU Y P . Design and implementation of topical search engine Web crawler[D]. Hefei: University of Electronic Science and Technol-ogy of China, 2009.
[8]	CAFARELLA M , CUTTING D . Building nutch:open source search.[J]. Queue, 2004, 2(2): 54-61.
[9]	敖东阳, 刘好杰 . Larbin分析与Windows平台下移植[J]. 智能计算机与应用， 2009(4): 23-24.
	AO D Y , LIU H J . Analysis of Larbin and transplantation to Win-dows system[J]. Computer Study, 2009(4): 23-24.
[10]	张敏，孙敏 . 基于 Heritrix 限定爬虫的设计与实现[J]. 计算机应用与软件， 2013, 30(4): 33-35.
	ZHANG M , SUN M . Design and implementation of qualified spi-der based on Heritrix[J]. Computer Applications and Software, 2013, 30(4): 33-35.
[11]	赵本本，殷旭东，王伟 . 基于 Scrapy 的 GitHub 数据爬虫[J]. 电子技术与软件工程， 2016(6): 199-202.
	ZHAO B B , YIN X D , WANG W . GitHub Web crawler based on Scrapy[J]. Journal of Electronic Technology And Software Engi-neering, 2016(6): 199-202.
[12]	KARKI R , GENNERT M A . Fresh analysis of streaming media stored on the Web [J]. 2011.
[13]	TONG W , XIE X Y . A research on a defending policy against the WebCrawler's attack[C]// International Conference on Anti- Coun-terfeiting,Security,and Identification in Communication. c2009: 363-366.

开源框架	二次开发	开发语言	支持分布式	镜像保存	优点
Larbin	易	C++	否	是	高效、高度定制化
Nutch	难	Java	是	否	支持分布式、提供抓取和索引、提供插件扩展
Heritrix	易	Java	否	是	高度可扩展、高度可控性
Scrapy	一般	Python	是	否	可用性高、功能完善

分类	工具
Python	Beautiful Soup、html5lib
Java	Jsoup、htmlparser、JTidy
C/C++	Gumbo、Htmlcxx、Streaming HTML parser、libhtml

Larbin体系结构的研究与优化

Study and optimization on system architectures of Larbin

在线阅读

pdf下载

可视化

摘要/Abstract

引用本文

使用本文

图/表 8

参考文献 13

相关文章 10

Metrics

推荐阅读 0

[1]	胡向东, 唐玲玲. 基于轻量级梯度提升机优化的工业互联网入侵检测方法[J]. 网络与信息安全学报, 2023, 9(2): 46-55.
[2]	刘赣秦, 李晖, 朱辉, 黄煜坤, 刘兴东. 低功耗嵌入式平台的SM2国密算法优化实现[J]. 网络与信息安全学报, 2022, 8(6): 29-38.
[3]	谢绒娜, 马铸鸿, 李宗俞, 田野. 基于卷积神经网络的加密流量分类方法[J]. 网络与信息安全学报, 2022, 8(6): 84-91.
[4]	黄诚, 孙明旭, 段仁语, 吴苏晟, 陈斌. 面向项目版本差异性的漏洞识别技术研究[J]. 网络与信息安全学报, 2022, 8(1): 52-62.
[5]	陈佩, 李凤华, 李子孚, 郭云川, 成林. 基于规则关联的安全数据采集策略生成[J]. 网络与信息安全学报, 2021, 7(5): 132-148.
[6]	毋文超, 任志宇, 杜学绘. 基于权限聚类的属性值优化[J]. 网络与信息安全学报, 2021, 7(4): 175-182.
[7]	王涛, 陈鸿昶. 考虑拜占庭属性的SDN安全控制器多目标优化部署方案[J]. 网络与信息安全学报, 2021, 7(3): 72-84.
[8]	顾笛儿, 卢华, 谢人超, 黄韬. 边缘计算开源平台综述[J]. 网络与信息安全学报, 2021, 7(2): 22-34.
[9]	李毅鹏,阮叶丽,张杰. 基于融合GMM聚类与FOA-GRNN模型的推荐算法[J]. 网络与信息安全学报, 2018, 4(12): 25-31.
[10]	李英俊,张宏莉,王星. 基于新闻事件片段的时序关系识别方法[J]. 网络与信息安全学报, 2017, 3(6): 33-41.