改进自训练模型在业务质差用户识别中的应用

doi:10.11959/j.issn.1000-0801.2021191

电信科学 ›› 2021, Vol. 37 ›› Issue (10): 136-142.doi: 10.11959/j.issn.1000-0801.2021191

改进自训练模型在业务质差用户识别中的应用

余立¹, 李哲¹, 高飞¹, 袁向阳¹, 杨永²

¹ 中国移动通信有限公司研究院，北京 100053
² 中国移动通信集团公司，北京 100033

修回日期:2021-06-20 出版日期:2021-10-20 发布日期:2021-10-01
作者简介:余立（1981− ），男，中国移动通信有限公司研究院人工智能与智慧运营中心副总经理、高级工程师，主要研究方向为前沿移动通信技术、网络智能化、大数据和IT技术
李哲（1992− ），男，中国移动通信有限公司研究院研究员，主要研究方向为 5G 核心网、人工智能、电信大数据、深度报文解析
高飞（1978− ），男，中国移动通信有限公司研究院研究员，主要研究方向为人工智能、网络大数据分析、数据治理
袁向阳（1978− ），男，中国移动通信有限公司研究院人工智能与智慧运营中心副总经理，主要研究方向为人工智能、网络智能化、大数据和IT技术
杨永（1972−），男，中国移动通信集团公司网络事业部服务保障室经理，主要研究方向为无线网络质量、业务指标规划分析

Application of improved self-training model in the identification of users with poor service quality

Li YU¹, Zhe LI¹, Fei GAO¹, Xiangyang YUAN¹, Yong YANG²

¹ China Mobile Research Institute, Beijing 100053, China
² China Mobile Communications Corporation, Beijing 100033, China

Revised:2021-06-20 Online:2021-10-20 Published:2021-10-01

摘要/Abstract

摘要：

质差用户识别是降低用户投诉率、提升用户满意度的重要环节。针对当前电信网络系统中业务感知相关的大量结构化及非结构化数据难以有效标注、质差用户标签不完备、现有监督学习模型训练样本不均衡而导致质差识别率低的问题，采用改进自训练半监督学习模型，利用少量满意度低分和投诉用户作为质差用户标签对网络数据进行标注，并通过标签迁移对大量未标注数据进行训练识别质差用户。实验表明，相比于识别准确率高但是训练成本高的全监督学习和识别准确率低的无监督学习，半监督学习可以充分利用无标签样本数据进行有效训练，保证较低训练成本的同时显著提升质差用户识别准确率。

关键词: 半监督学习, 改进自训练模型, 质差用户识别, 无标签数据

Abstract:

Poor quality user identification is an important method to reduce the complaint rate and increase satisfaction.It is difficult to effectively label a large amount of structured and unstructured data related to business perception in current telecommunications network systems, poor quality user labels are not complete, and the existing supervised learning model training samples are unbalanced, resulting in a low quality recognition rate.An improved self-training semi-supervised learning model was adopted, a small number of low-satisfaction and complaint users as poor quality user labels was used to label network data, and label migration was used to train a large amount of unlabeled data to identify poor quality users.Experiments show that compared to fully supervised learning with high recognition model accuracy but high training cost and unsupervised learning with low recognition model accuracy, semi-supervised learning can make full use of unlabeled sample data for effective training, ensuring lower training costs and the recognition accuracy of poor-quality users is significantly improved.

Key words: semi-supervised learning, improved self-training model, poor quality user identification, unlabeled data

中图分类号:

TN915.41

余立, 李哲, 高飞, 袁向阳, 杨永. 改进自训练模型在业务质差用户识别中的应用[J]. 电信科学, 2021, 37(10): 136-142.

Li YU, Zhe LI, Fei GAO, Xiangyang YUAN, Yong YANG. Application of improved self-training model in the identification of users with poor service quality[J]. Telecommunications Science, 2021, 37(10): 136-142.

图/表 7

表1

图1

表2

表3

图2

图3

表4

参考文献 14

[1]	麻瓯勃, 刘雪娇, 唐旭栋 ,等. 基于半监督学习的恶意 URL检测方法[J]. 计算机系统应用, 2020,29(11): 11-20.
	MA O B , LIU X J , TANG X D ,et al. Malicious URL detection based on semi-supervised learning[J]. Computer Systems ＆Applications, 2020,29(11): 11-20.
[2]	欧阳晔, 杨爱东, 孟凡语 . 一种博弈论辅助的机器学习算法检测用户流失行为[J]. 电信科学, 2020,36(6): 79-89.
	OUYANG Y , YANG A D , MENG F Y . A game theory-assisted machine learning methodology for subscriber churn behaviors detection[J]. Telecommunications Science, 2020,36(6): 79-89.
[3]	张俊丽, 常艳丽, 师文 . 标签传播算法理论及其应用研究综述[J]. 计算机应用研究, 2013,30(1): 21-25.
	ZHANG J L , CHANG Y L , SHI W . Overview on label propagation algorithm and applications[J]. Application Research of Computers, 2013,30(1): 21-25.
[4]	彭杰, 龚晓峰, 李剑 . LPA-SKFST 半监督特征提取方法[J]. 计算机应用研究, 2021,38(6): 1657-1661.
	PENG J , GONG X F , LI J . Semi-supervised feature extraction method based on LPA-SKFST[J]. Application Research of Computers, 2021,38(6): 1657-1661.
[5]	周志华 . 基于分歧的半监督学习[J]. 自动化学报, 2013,39(11): 1871-1878.
	ZHOU Z H . Disagreement-based Semi-supervised Learning[J]. Acta Automatica Sinica, 2013,39(11): 1871-1878.
[6]	ZHOU Z H , LI M . Tri-training:exploiting unlabeled data using three classifiers[J]. IEEE Transactions on Knowledge and Data Engineering, 2005,17(11): 1529-1541.
[7]	HUANG H J , QIAN L , WANG Y J . A SVM-based technique to detect phishing URLs[J]. Information Technology Journal, 2012,11(7): 921-925.
[8]	NIGAM K , GHANI R . Analyzing the effectiveness and applicability of co-training[C]// Proceedings of the ninth international conference on Information and knowledge management.[S.l.:s.n.], 2000: 86-93.
[9]	ZHOU Z H . Disagreement-based semi-supervised Learning[J]. Acta Automatica Sinica, 2013,39(11): 1871.
[10]	MCCLOSKY D , CHARNIAK E , JOHNSON M . Effective self-training for parsing[C]// Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics. Piscataway:IEEE Press, 2006.
[11]	ROSENBERG C , HEBERT M , SCHNEIDERMAN H . Semi-supervised self-training of object detection models[C]// Proceedings of 2005 Seventh IEEE Workshops on Applications of Computer Vision. Piscataway:IEEE Press, 2005: 29-36.
[12]	黎隽男 . 半监督自训练方法的研究[D]. 重庆:重庆师范大学, 2018.
	LI J N . Research on semi-supervised self-training method[D]. Chongqing:Chongqing Normal University, 2018.
[13]	董立岩, 隋鹏, 孙鹏 ,等. 基于半监督学习的朴素贝叶斯分类新算法[J]. 吉林大学学报(工学版), 2016,46(3): 884-889.
	DONG L Y , SUI P , SUN P ,et al. Novel naive Bayes classification algorithm based on semi-supervised learning[J]. Journal of Jilin University (Engineering and Technology Edition), 2016,46(3): 884-889.
[14]	陈颖, 杨欣, 孙道贺 . 基于 GA-XGBoost 模型的大学生科研能力预测问题研究[J]. 数学的实践与认识, 2021,51(6): 318-328.
	CHEN Y , YANG X , SUN D H . Research on college students' scientific research ability prediction based on GA-XGBoost model[J]. Mathematics in Practice and Theory, 2021,51(6): 318-328.

方法	参考文献	优势	劣势
基于图的半监督	[3-4]	应用场景广泛、拓展性强	新样本加入重新训练
基于分歧的半监督	[5-6]	多分类器合作，提高准确率	不同分类器性能要求高
半监督支持向量机	[7]	数据样本需求较少	可解释性差、准确率受限
协同训练	[8-9]	提高模型预测精度	数据预处理要求较高
自训练	[10-12]	运行效率快，精度较高	错误累计会导致性能下降

数据类型	数据条数/条	属性值/列	标签/列
质差用户	3 000	15	1
非质差用户	3 000	15	1

模型	精准度	AUC	F1
全监督模型	0.970 6	0.970 6	0.969 9
半监督模型	0.939 4	0.938 9	0.936 5
无监督模型	0.740 5	—	—

轮次	缺失值比率	精准度	AUC	F1
1	0.900 0	0.918 4	0.884 6	0.878 1
2	0.367 9	0.975 5	0.816 8	0.825 4
3	0.267 9	0.976 6	0.810 7	0.819 3
4	0.157 6	0.976 8	0.812 5	0.818 8
5	0.091 9	0.976 7	0.801 7	0.810 5
6	0.066 9	0.968 9	0.803	0.812 8
7	0.056 9	0.977 5	0.796 8	0.806 5
8	0.030 5	0.978 1	0.799 1	0.809 0
9	0.026 7	0.974 3	0.792 8	0.805 4
10	0.026 0	0.978 2	0.796 8	0.806 1

改进自训练模型在业务质差用户识别中的应用

Application of improved self-training model in the identification of users with poor service quality

在线阅读

PDF下载

可视化

摘要/Abstract

引用本文

使用本文

图/表 7

参考文献 14

相关文章 1

Metrics

推荐阅读 0