电信科学 ›› 2016, Vol. 32 ›› Issue (12): 7-12.doi: 10.11959/j.issn.1000-0801.2016317

• 专题:大数据 • 上一篇    下一篇

随机森林在运营商大数据补全中的应用

王铮,任华,方燕萍   

  1. 中国电信股份有限公司上海研究院,上海 200122
  • 出版日期:2016-12-20 发布日期:2017-04-26

Application of random forest in big data completion

Zheng WANG,Hua REN,Yanping FANG   

  1. Shanghai Research Institute of China Telecom Co., Ltd., Shanghai 200122, China
  • Online:2016-12-20 Published:2017-04-26

摘要:

电信运营商有大量数据,但是鉴于多种原因,数据的质量不够理想,出现大量数据不完整甚至缺失。对于已有数据的挖掘,必须在数据满足质量要求且达到足够采样比例的前提下开展。依托现有的全国日志留存系统,设计完整数据的模板样库,鉴别不能满足质量要求的数据,使用随机森林算法,找到最符合的相同或相关数据,补全数据并提升数据质量;用回溯反馈的方法优化并扩充模板样库。在全国日志留存系统中构建数据补全子系统,实现端到端的数据质量保障和提升,补全并改善历史数据甚至实时数据的质量,最终满足数据处理和挖掘的要求,提升运营商数据质量和价值。

关键词: 大数据, 随机森林, 机器学习, 数据补全

Abstract:

Telecom operators have a lot of data, but in view of a variety of reasons, the quality of the data is not ideal, there are a lot of data is not complete or even missing. For existing data mining, it is necessary to carry out the data to meet the quality of the data and to achieve sufficient sampling proportion. Relying on the country's existing log retention system, template library design data integrity, authentication could not meet the quality requirements of the data, using the random forest algorithm, the same data with or related data was found, data was completed and data quality was improved, and the template library was extended by optimization of feedback. The construction of completion data subsystem in the system log retained end-to-end data quality guaranteed and improved quality, completed and improved the real-time data and historical data, and ultimately met the requirements of data processing and mining operators, improved data quality and value.

Key words: big data, random forest, machine learning, data completion

No Suggested Reading articles found!