通信学报 ›› 2022, Vol. 43 ›› Issue (4): 154-163.doi: 10.11959/j.issn.1000-436x.2022080

• 学术论文 • 上一篇    下一篇

流数据实时接收方案的研究

张笑燕, 刘志浩, 杜晓峰, 陆天波   

  1. 北京邮电大学计算机学院(国家示范性软件学院),北京 100876
  • 修回日期:2022-04-05 出版日期:2022-04-25 发布日期:2022-04-01
  • 作者简介:张笑燕(1973- ),女,山东烟台人,博士,北京邮电大学教授,主要研究方向为软件工程理论、移动互联网软件与大数据分析
    刘志浩(1996- ),男,山东临沂人,北京邮电大学硕士生,主要研究方向为大数据分析、移动与互联网软件
    杜晓峰(1973- ),男,陕西韩城人,北京邮电大学讲师,主要研究方向为云计算与大数据分析
    陆天波(1977- ),男,贵州毕节人,博士,北京邮电大学教授,主要研究方向为网络与信息安全、安全软件工程和P2P计算
  • 基金资助:
    国家自然科学基金资助项目(62162060)

Research on a real-time receiving scheme of streaming data

Xiaoyan ZHANG, Zhihao LIU, Xiaofeng DU, Tianbo LU   

  1. School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts and Telecommunications, Beijing 100876, China
  • Revised:2022-04-05 Online:2022-04-25 Published:2022-04-01
  • Supported by:
    The National Natural Science Foundation of China(62162060)

摘要:

针对现代数据仓库系统中常见的需接收大量流数据,且其与磁盘上已有的数据做连接后再入库的场景进行了探讨。通过合理设置磁盘分页和应用缓存模块,分散磁盘I/O压力,在已有研究的基础上提出了一种具有更高效率的数据接收方案,并引入一致性哈希函数将其扩展到分布式环境,提出一种应用于分布式环境的D-CACHEJOIN算法。通过理论计算算法的成本模型,并使用服从Zipfian分布的数据进行模拟实验。实验结果表明,在接近现实的实际应用场景下,所提算法拥有比现有算法更高的效率,同时能够快速方便地扩展到分布式环境。

关键词: 流数据, 缓存, 分布式系统, 一致性哈希函数

Abstract:

Discussing the common scenarios in modern data warehouse systems that need to receive a large amount of streaming data, connect it with the existing data on the disk, and then store it in the warehouse.By rationally setting disk paging and applying cache modules to disperse the disk I/O pressure, a more efficient data receiving scheme was proposed based on the existing research, and a consistent Hash function was introduced and extended to distributed environment and a D-CACHEJOIN algorithm applied to distributed environment was proposed.The cost model of the algorithm was calculated by theory and simulation experiment was performed using data that obey the Zipfian distribution.The experiment results show that the proposed algorithm has higher efficiency than existing algorithms in practical application scenarios close to reality, and can be quickly and easily extended to distributed environments.

Key words: streaming data, cache, distributed system, consistent Hash function

中图分类号: 

No Suggested Reading articles found!