电信科学 ›› 2017, Vol. 33 ›› Issue (8): 180-186.doi: 10.11959/j.issn.1000-0801.2017234

• 运营技术广角 • 上一篇    下一篇

基于云平台的分布式高性能网络爬虫的研究与设计

石恩名,肖晓军,卢宇   

  1. 广州优亿信息科技有限公司,广东 广州 510630
  • 修回日期:2017-07-27 出版日期:2017-08-01 发布日期:2017-08-25
  • 作者简介:石恩名(1991-),男,现就职于广州优亿信息科技有限公司,主要研究方向为数据挖掘、人工智能和地理信息系统等。|肖晓军(1970-),男,博士,广州优亿信息科技有限公司高级工程师,主要研究方向为大数据、数据挖掘和电信行业应用等。|卢宇(1983-),男,现就职于广州优亿信息科技有限公司,主要研究方向为大数据、机器学习和人工智能等。

Research and design of distributed high-performance network reptiles based on cloud platform

Enming SHI,Xiaojun XIAO,Yu LU   

  1. Guangzhou Useease Information Technology Co.,Ltd.,Guangzhou 510630,China
  • Revised:2017-07-27 Online:2017-08-01 Published:2017-08-25

摘要:

随着大数据时代的到来,数据成为最宝贵的资源,而网络爬虫技术作为外部数据采集的重要手段,已然成为数据分析的标配。介绍了一种高性能、灵活和便捷的基于云平台的爬虫架构设计和实现。从爬虫的整体架构、分布式设计以及各模块的设计等角度进行了详细的阐述。爬虫各模块用 Docker 封装,Kubernetes做集群的资源调度和管理,在性能优化上采用了MD5去重树算法、DNS优化和异步I/O等多种策略组合的形式。实验表明,对比未优化的方案,爬虫在性能上具有较明显的优势。

关键词: 分布式系统架构, 网络爬虫, Docker, 高性能

Abstract:

With the arrival of large data age,data has become the most valuable resource.And web crawler technology as an important means of external data collection,has become a standard tool for data analysis.A high-performance,convenient cloud-based crawler architecture design was introduced.The overall structure of the reptile to the distributed design and the design of the sub-module was described in detail.Each module of the crawler was encapsulated in Docker,and Kubernetes was used as the resource scheduling and management of the cluster.In the performance of optimization,the MD5 reset tree algorithm,DNS optimization and asynchronous I/O were adopted.Experimental results show that the performance of crawler has obvious advantages compared with the UN optimized scheme.

Key words: distributed system architecture, web crawler, Docker, high-performance

中图分类号: 

No Suggested Reading articles found!