面向自然语言理解的多教师BERT模型蒸馏研究

doi:10.11959/j.issn.2096-0271. 2023039

摘要/Abstract

摘要： 知识蒸馏是一种常用于解决BERT等深度预训练模型规模大、推断慢等问题的模型压缩方案。而采用“多教师蒸馏”的方法，可以进一步提高学生模型的表现，而传统的对教师模型中间层采用的“一对一”强制指定的策略会导致大部分的中间特征被舍弃。提出了一种“单层对多层”的映射方式，解决了知识蒸馏时中间层无法对齐的问题，帮助学生模型掌握教师模型中间层中的语法、指代等知识。在GLUE中的若干数据集的实验表明学生模型保留了教师模型平均推断准确率的93.9%的同时，只占用了教师模型平均参数规模的41.5%。

关键词:

深度预训练模型, BERT, 多教师蒸馏, 自然语言理解

Abstract: Knowledge distillation is a model compression scheme commonly used to solve the problems of large scale and slow inference of BERT constant depth pre-training model. The method of "multi-teacher distillation" can further improve the performance of the student model, while the traditional "one-to-one" mapping method mandatory assignment strategy for the middle layer of the teacher model will lead to the abandonment of most of the middle features. The "one-to-many" mapping method is proposed to solve the problem that the middle layer cannot be aligned during knowledge distillation, and help students master the grammar, reference and other knowledge in the middle layer of the teacher model. Experiments on several data sets in GLUE show that the student model retains 93.9% of the average inference accuracy of the teacher model, while only accounting for 41.5% of the average parameter size of the teacher model.

Key words:

"> deep pre-training models, BERT, multi-teacher distillation, nature language understanding

石佳来, 郭卫斌. 面向自然语言理解的多教师BERT模型蒸馏研究[J]. 大数据, doi: 10.11959/j.issn.2096-0271. 2023039.

SHI Jialai, GUO Weibin. Multi-teacher distillation BERT model in NLU tasks[J]. Big Data Research, doi: 10.11959/j.issn.2096-0271. 2023039.

参考文献

[1] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of the 2019 Conference of the North American Chapter ofthe Association for Computational Linguistics: Human Language Technologies, Minneapolis, June 2-7, 2019. ACL press, 2019: 4171-4186.

[2] YANG Z L, DAI Z L, CARBONELL J G, et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding[C]//Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, Canada, December 8-14, 2019. New York: NeurIPS, 2019: 5754-5764.

[3] LIU Z, LIN W, SHI Y, et al. A robustly optimized BERT pre-training approach with post-training[C]//Chinese Computational Linguistics: 20th China National Conference, CCL 2021, Hohhot, China, August 13–15, 2021, Proceedings. Cham: Springer International Publishing, 2021: 471-484.

[4] XIE K, LU S, WANG M, et al. Elbert: Fast albert with confidence-window based early exit[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021: 7713-7717.

[5] LAN Z Z, CHEN M D, GOODMAN S, et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations[C]//8th International Conference on Learning Representations, Ethiopia, April 26-30, 2020. New York: OpenReview.net, 2020: 564-571.

[6] JIAO X Q, YIN Y C, SHANG L F, et al. TinyBERT: Distilling BERT for Natural Language Understanding[C]// Findings of the Association for Computational Linguistics, Online Event, November 16-20, 2020. New York: EMNLP 2020: 4163-4174.

[7] SUN S Q, CHENG Y, GEN Z, et al.Patient Knowledge Distillation for BERT Model Compression[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, November 3-7, 2019. New York: EMNLP-IJCNLP, 2019: 4322-4331.

[8] ILICHEV A, SOROKIN N, PIONTKOVSKAYA I, et al. Multiple Teacher Distillation for Robust and Greener Models[C]//Proceedings of the International Conference on Recent Advances in Natural Language Processing, Held Online, 1-3September, 2021. New York: RANLP, 2021: 601-610.

[9] WANG A, SINGH A, MICHAEL J, et al. GLUE: A multi-task benchmark and analysis platform for natural language understanding[J]. arXiv preprint arXiv:1804.07461, 2018.

[10] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.

[11] 任欢,王旭光. 注意力机制综述[J]. 计算机应用,2021,41(z1):1-6.

REN H, WANG X G. Overview of attention mechanism [J] Computer Applications, 2021,41 (z1): 1-6.

[12] 李爱黎,张子帅,林荫,等. 基于社交网络大数据的民众情感监测研究[J]. 大数据,2022,8(6):105-126.

LI A L, ZHANG Z S, LIN Y, et al. Research on public emotion monitoring based on social network big data [J] Big Data,2022,8(6):105-126.

[13] 韩立帆,季紫荆,陈子睿,等. 数字人文视域下面向历史古籍的信息抽取方法研究[J]. 大数据,2022,8(6):26-39.

HAN L F, JI Z J, CHEN Z R, etc Research on information extraction from historical ancient books from the perspective of digital humanities [J] Big data, 2022,8 (6): 26-39..

[14] MICHEL P, LEVY O, NEUBIG G. Are sixteen heads really better than one?[J]. Advances in neural information processing systems, 2019, 32.

[15] XU Y, WANG Y, ZHOU A, et a1. Deep neural network compression with single and multiple level quantization //[C] Proc of the 32nd AAAI Conf on Artificial Intelligence.2018.

[16] ZAFRIR O, BOUDOUKH G, IZSAK P, et al. Q8bert: Quantized 8bit bert[C]//2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS). IEEE, 2019: 36-39.

[17] HINTON G, VINYALS O, DEAN J. Distilling the knowledge in a neural network[J]. arXiv preprint arXiv:1503.02531, 2015, 2(7).

[18] Al-OMARI H, ABDULLAH M A, SHAIKH S. Emodet2: Emotion detection in english textual dialogue using bert and bilstm models[C]//2020 11th International Conference on Information and Communication Systems (ICICS). IEEE, 2020: 226-232.

[19] 杨秋勇,彭泽武,苏华权,等. 基于Bi-LSTM-CRF的中文电力实体识别[J]. 信息技术,2021(9):45-50.

YANG Q Y, PENG Z W, SU H Q, et al. Chinese power entity recognition based on Bi-LSTM-CRF [J] Information Technology, 2021 (9): 45-50.

[20] 叶榕,邵剑飞,张小为,等. 基于BERT-CNN的新闻文本分类的知识蒸馏方法研究[J]. 电子技术应用,2023,49(1):8-13.

YE R, SHAO J F, ZHANG X W, et al. Research on knowledge distillation method of news text classification based on BERT-CNN [J] Application of Electronic Technology, 2023,49 (1): 8-13.

[21] XU C, ZHOU W, GE T, et al. BERT-of-theseus: Compressing BERT by progressive module replacing[C]//Proceedings of Empirical Methods in Natural Language Processing (EMNLP). 2021: 7859-7869,

[22] 张睿东. 基于BERT和知识蒸馏的自然语言理解研究 [D].南京大学,2020.

ZHANG E D. Research on natural language understanding based on BERT and knowledge distillation [D]. Nanjing University, 2020.

[23] FUKUDA T, KURATA G. Generalized Knowledge Distillation from an Ensemble of Specialized Teachers Leveraging Unsupervised Neural Clustering[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021: 6868-6872.

[24] CHO J H, HARIHARAN B. On the efficacy of knowledge distillation[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 4794-4802.

[25] JIANG L, WEN Z, LIANG Z, el al. Long short-term sample distillation//[C]Proceedings of the AAAI Conference on Artificial Intelligence. New York, USA, 2020: 4345-4352.

[26] YANG Z, SHOU L, GONG M, et al. Model compression with two-stage multi-teacher knowledge distillation for web question answering system[C]//Proceedings of the 13th International Conference on Web Search and Data Mining. 2020: 690-698.

[27] WU C, WU F Z, HUANG Y F. One Teacher is Enough? Pre-trained Language Model Distillation from Multiple Teachers[C]// Findings of the Association for Computational Linguistics New York: ACL Press, 2021: 4408-4413.

[28] YUAN F, SHOU L, PEI J, et al. Reinforced multi-teacher selection for knowledge distillation[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2021, 35(16): 14284-14291.

[29] CLARK K, LUONG M T, LE Q V, et al.ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators[C]// 8th International Conference on Learning Representations, Addis Ababa, April 26-30, 2020. New York: ICLR, 2020.