大数据

• •    

面向自然语言理解的多教师BERT模型蒸馏研究

石佳来, 郭卫斌   

  1. 华东理工大学 信息科学与工程学院, 上海市 200237

Multi-teacher distillation BERT model in NLU tasks

SHI Jialai, GUO Weibin    

  1. School of Information Science and Technology, East China University of Technology, Shanghai 200237, China

摘要: 知识蒸馏是一种常用于解决BERT等深度预训练模型规模大、推断慢等问题的模型压缩方案。而采用“多教师蒸馏”的方法,可以进一步提高学生模型的表现,而传统的对教师模型中间层采用的“一对一”强制指定的策略会导致大部分的中间特征被舍弃。提出了一种“单层对多层”的映射方式,解决了知识蒸馏时中间层无法对齐的问题,帮助学生模型掌握教师模型中间层中的语法、指代等知识。在GLUE中的若干数据集的实验表明学生模型保留了教师模型平均推断准确率的93.9%的同时,只占用了教师模型平均参数规模的41.5%。

关键词:

深度预训练模型, BERT, 多教师蒸馏, 自然语言理解

Abstract: Knowledge distillation is a model compression scheme commonly used to solve the problems of large scale and slow inference of BERT constant depth pre-training model. The method of "multi-teacher distillation" can further improve the performance of the student model, while the traditional "one-to-one" mapping method mandatory assignment strategy for the middle layer of the teacher model will lead to the abandonment of most of the middle features. The "one-to-many" mapping method is proposed to solve the problem that the middle layer cannot be aligned during knowledge distillation, and help students master the grammar, reference and other knowledge in the middle layer of the teacher model. Experiments on several data sets in GLUE show that the student model retains 93.9% of the average inference accuracy of the teacher model, while only accounting for 41.5% of the average parameter size of the teacher model.

Key words:

"> deep pre-training models, BERT, multi-teacher distillation, nature language understanding

No Suggested Reading articles found!