电信科学 ›› 2013, Vol. 29 ›› Issue (8): 146-150.doi: 10.3969/j.issn.1000-0801.2013.08.025

• 运营创新论坛 • 上一篇    下一篇

基于微博APl的分布式抓取技术

陈舜华1,王晓彤1,郝志峰1,蔡瑞初1,肖晓军2,卢字2   

  1. 1 广东工业大学计算机学院 广州 510006
    2 广州优亿信息科技有限公司 广州 510630
  • 出版日期:2013-08-15 发布日期:2017-06-21

A Distributed Data-Crawling Technology for Microblog API

Shunhua Chen1,Xiaotong Wang1,Zhifeng Hao1,Ruichu Cai1,Xiaojun Xiao2,Yu Lu2   

  1. 1 School of Computers,Guangdong University of Technology,Guangzhou 510006,China
    2 Guangzhou Useease Information Technology Co.,Ltd.,Guangzhou 510630,China
  • Online:2013-08-15 Published:2017-06-21

摘要:

随着微博用户的迅猛增长,越来越多的人希望从用户的行为和微博内容中挖掘有趣的模式。针对如何对微博数据进行有效合理的采集,提出了基于微博API的分布式抓取技术,通过模拟微博登录自动授权,合理控制API的调用频次,结合任务分配控制器高效地获取微博数据。该分布式抓取技术还结合时间触发和内存数据库技术实现重复控制,避免了数据的重复爬取和重复存储,提高了系统的性能。本分布式抓取技术具有可扩展性高、任务分配明确、效率高、多种爬取策略适应不同的爬取需求等特点。新浪微博数据爬取实例验证了该技术的可行性。

关键词: 新浪微博, 爬取策略, 分布式爬取, 微博API

Abstract:

As more and more users begin to use microblog,people eagerly want to dig interesting patterns from the microblog data.How to efficiently collect data from the service provider is one of the main challenges.To address this issue,a distributed crawling solution based on microblog API was present.The distributed crawling solution simulates microblog login,automatically gets authorized,and control the invoked frequency of the API with a task controller.A time trigger method with memory database was also proposed to avoid extra trivial data duplication and improve efficiency of the system.In the distributed framework,the crawling tasks can be assigned to distributed clients independently,which ensures the high scalability and flexibility of the crawling procedure.The feasibility of the crawler technology according to Sina microblog instance was verified.

Key words: Sina microblog, crawling strategy, distributed crawl, microblog API

No Suggested Reading articles found!