支持鹏程系列开源大模型应用生态演化的可持续学习能力探索

doi:10.11959/j.issn.2096-6652.202212

Abstract

Abstract:

Large-scale pre-training models have achieved great success in the field of natural language processing by using large-scale corpora and pre-training tasks.With the gradual development of large models, the continual learning ability of large models has become a new research focus.The continual learning technology of the Peng Cheng series large models, the exploration of practice and the still facing challenges were mainly introduced, including the Peng Cheng series continual learning technology through task expansion, data increment and knowledge reasoning, Peng Cheng PANGU multi-task continual learning and the practical exploration of the continual learning ability of the Peng Cheng TONGYAN open source large model, the vocabulary update, semantic mapping and knowledge conflicts that the large model faces in the process of continual learning.

Key words: Peng Cheng series large model, continual learning, Peng Cheng PANGU, Peng Cheng TONGYAN, open source large model

CLC Number:

TP391.1

Yue YU, Xin LIU, Fangqing JIANG, et al. Exploration of the continual learning ability that supports the application ecological evolution of the large-scale pretraining Peng Cheng series open source models[J]. Chinese Journal of Intelligent Science and Technology, 2022, 4(1): 97-108.

Figures/Tables 11

语言	Transformer（多模型）		M2M-100 （官方版）		M2M-1.2B （微调）		鹏程?通言（单模型）
语言	中文-xx	xx-中文	中文-xx	xx-中文	中文-xx	xx-中文	中文-xx	提升	xx-中文	提升
意大利语	17.10	27.87	15.00	31.12	26.10	32.64	$29 . 90$	12.80	$38 . 67$	10.8
捷克语	14.88	31.50	10.90	29.18	16.50	32.90	$18 . 20$	3.32	$35 . 43$	3.93
荷兰语	15.21	27.46	14.20	32.92	21.40	37.44	$24 . 20$	8.99	$42 . 13$	14.67
葡萄牙语	17.46	27.79	15.80	32.99	27.40	36.89	$28 . 40$	10.94	$40.84$	13.05
印度尼西亚语	13.06	18.77	15.60	31.14	25.10	35.29	$27 . 20$	14.14	$38 . 89$	20.12
希伯来语	16.21	16.89	8.20	12.23	16.10	18.51	$18 . 00$	1.79	$20.32$	3.43
波斯尼亚语	11.38	16.23	4.40	11.80	11.90	17.46	$13 . 50$	2.12	$19 . 39$	3.16
希腊语	8.20	13.98	6.50	11.49	14.80	16.81	$17 . 10$	8.90	$18 . 72$	4.74
克罗地亚语	14.32	16.86	5.20	12.77	14.60	17.74	$15 . 70$	1.38	$19 . 98$	3.12

References 42

[1]	唐杰, 刘洋, 刘知远 ,等. 认知与可持续学习:预训练模型的技术展望[J]. 中国计算机学会通讯, 2021,17(5): 1-8.
	TANG J , LIU Y , LIU Z Y ,et al. Cognition and sustainable learning:technical prospects of pre-training models[J]. Communications of the CCF, 2021,17(5): 1-8.
[2]	ZENG W , REN X Z , SU T ,et al. PanGu-α:large-scale autoregressive pretrained Chinese language models with auto-parallel computation[J]. arXiv preprint, 2021,arXiv:2104.12369.
[3]	DEVLIN J , CHANG M W , LEE K ,et al. BERT:pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Stroudsburg:Association for Computational Linguistics, 2019: 4171-4186.
[4]	LAMPLE G , CONNEAU A . Cross-lingual language model pretraining[J]. Advances in Neural Information Processing Systems, 2019,32: 7059-7069.
[5]	LIU Y H , OTT M , GOYAL N ,et al. RoBERTa:a robustly optimized BERT pretraining approach[J]. arXiv preprint, 2019,arXiv:1907.11692.
[6]	XUE L T , CONSTANT N , ROBERTS A ,et al. MT5:a massively multilingual pre-trained text-to-text transformer[C]// Proceedings of 2021 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Stroudsburg:Association for Computational Linguistics, 2021: 483-498.
[7]	BROWN T B , MANN B , RYDER N ,et al. Language models are few-shot learners[J]. arXiv preprint, 2020,arXiv:2005.14165.
[8]	FEDUS W , ZOPH B , SHAZEER N . Switch transformers:scaling to trillion parameter models with simple and efficient sparsity[J]. arXiv preprint, 2021,arXiv:2101.03961.
[9]	YANG A , LIN J Y , MEN R ,et al. Exploring sparse expert models and beyond[J]. arXiv preprint, 2021arXiv:2105.15082.
[10]	PETERS M , NEUMANN M , IYYER M ,et al. Deep contextualized word representations[C]// Proceedings of 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Stroudsburg:Association for Computational Linguistics, 2018: 2227-2237.
[11]	SONG K , TAN X , QIN T ,et al. MASS:masked sequence to sequence pre-training for language generation[C]// Proceedings of the International Conference on Machine Learning.[S.l.:s.n.], 2019: 5926-5936.
[12]	LIU Y H , GU J T , GOYAL N ,et al. Multilingual denoising pre-training for neural machine translation[J]. Transactions of the Association for Computational Linguistics, 2020,8: 726-742.
[13]	CHEN Z Y , MA N Z , LIU B . Lifelong learning for sentiment classification[C]// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2:Short Papers). Stroudsburg:Association for Computational Linguistics, 2015: 750-756.
[14]	MERMILLOD M , BUGAISKA A , BONIN P . The stability-plasticity dilemma:investigating the continuum from catastrophic forgetting to age-limited learning effects[J]. Frontiers in Psychology, 2013,4:504.
[15]	MAI Z D , LI R W , JEONG J ,et al. Online continual learning in image classification:an empirical survey[J]. Neurocomputing, 2022,469: 28-51.
[16]	DELANGE M , ALJUNDI R , MASANA M ,et al. A continual learning survey:defying forgetting in classification tasks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021:1.
[17]	REBUFFI S A , KOLESNIKOV A , SPERL G ,et al. iCaRL:incremental classifier and representation learning[C]// Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2017: 5533-5542.
[18]	DE LANGE M , TUYTELAARS T . Continual prototype evolution:learning online from non-stationary data streams[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision.[S.l.:s.n.], 2021: 8250-8259.
[19]	ROBINS A . Catastrophic forgetting,rehearsal and pseudorehearsal[J]. Connection Science, 1995,7(2): 123-146.
[20]	GOODFELLOW I , POUGET-ABADIE J , MIRZA M ,et al. Generative adversarial nets[J]. Advances in Neural Information Processing Systems, 2014,27: 2672-2680.
[21]	LOPEZ-PAZ D , RANZATO M A . Gradient episodic memory for continual learning[J]. Advances in Neural Information Processing Systems, 2017,30: 6467-6476.
[22]	CHAUDHRY A , RANZATO M , ROHRBACH M ,et al. Efficient lifelong learning with A-GEM[C]// Proceedings of the International Conference on Learning Representations.[S.l.:s.n.], 2019.
[23]	ALJUNDI R , LIN M , GOUJAUD B ,et al. Online continual learning with no task boundaries[J]. arXiv preprint, 2019,arXiv:1903.08671.
[24]	SILVER D L , MERCER R E . The task rehearsal method of life-long learning:overcoming impoverished data[M]// Advances in artificial intelligence. Heidelberg: Springer Berlin Heidelberg, 2002: 90-101.
[25]	RANNEN A , ALJUNDI R , BLASCHKO M B ,et al. Encoder based lifelong learning[C]// Proceedings of 2017 IEEE International Conference on Computer Vision. Piscataway:IEEE Press, 2017: 1329-1337.
[26]	KIRKPATRICK J , PASCANU R , RABINOWITZ N ,et al. Overcoming catastrophic forgetting in neural networks[J]. Proceedings of the National Academy of Sciences of the United States of America, 2017,114(13): 3521-3526.
[27]	NGUYEN C V , LI Y Z , BUI T D ,et al. Variational continual learning[J]. arXiv preprint. 2017,arXiv:1710.10628.
[28]	ALJUNDI R , CHAKRAVARTY P , TUYTELAARS T . Expert Gate:lifelong learning with a network of experts[C]// Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2017: 7120-7129.
[29]	BIESIALSKA M , BIESIALSKA K , COSTA-JUSSà M R , . Continual lifelong learning in natural language processing:a survey[C]// Proceedings of the 28th International Conference on Computational Linguistics. PA:International Committee on Computational Linguistics, 2020: 6523-6541.
[30]	SUN Y , WANG S H , LI Y K ,et al. ERNIE 2.0:a continual pre-training framework for language understanding[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020,34(5): 8968-8975.
[31]	WANG R Z , TANG D Y , DUAN N ,et al. K-Adapter:infusing knowledge into pre-trained models with adapters[C]// Proceedings of the Findings of the Association for Computational Linguistics. Stroudsburg:Association for Computational Linguistics, 2021.
[32]	MA J Q , ZHAO Z , YI X Y ,et al. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts[C]// Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery ＆ Data Mining. New York:ACM Press, 2018: 1930-1939.
[33]	BENGIO Y , LOURADOUR J , COLLOBERT R ,et al. Curriculum learning[C]// Proceedings of the 26th Annual International Conference on Machine Learning. New York:ACM Press, 2009: 41-48.
[34]	杜会芳, 王昊奋, 史英慧 ,等. 知识图谱多跳问答推理研究进展、挑战与展望[J]. 大数据, 2021,7(3): 60-79.
	DU H F , WANG H F , SHI Y H ,et al. Progress,challenges and research trends of reasoning in multi-hop knowledge graph based question answering[J]. Big Data Research, 2021,7(3): 60-79.
[35]	CUI Y , LIU T , CHEN Z ,et al. Dataset for the first evaluation on Chinese machine reading comprehension[C]// Proceedings of the 11th International Conference on Language Resources and Evaluation.[S.l.:s.n.], 2018: 2721-2725.
[36]	XU L , HU H , ZHANG X W ,et al. CLUE:a Chinese language understanding evaluation benchmark[C]// Proceedings of the 28th International Conference on Computational Linguistics. PA:International Committee on Computational Linguistics, 2020: 4762-4772.
[37]	XU L , DONG Q Q , YU C ,et al. CLUENER2020:fine-grained name entity recognition for Chinese[J]. arXiv preprint, 2020,arXiv:2001.04351.
[38]	亢晓勉, 宗成庆 . 融合篇章结构位置编码的神经机器翻译[J]. 智能科学与技术学报, 2020,2(2): 144-152.
	KANG X M , ZONG C Q . Fusion of discourse structural position encoding for neural machine translation[J]. Chinese Journal of Intelli-gent Science and Technology, 2020,2(2): 144-152.
[39]	CHEN T Q , GOODFELLOW I , SHLENS J . Net2Net:accelerating learning via knowledge transfer[J]. arXiv preprint, 2015,arXiv:1511.05641.
[40]	KAPLAN J , MCCANDLISH S , HENIGHAN T ,et al. Scaling laws for neural language models[J]. arXiv preprint, 2020,arXiv:2001.08361.
[41]	SHZAEER N , MIRHOSEINI A , MAZIARZ K ,et al. Outrageously large neural networks:the sparsely-gated mixture-of-experts layer[J]. arXiv preprint, 2017,arXiv:1701.06538.
[42]	ESCOLANO C , COSTA-JUSSà M R , FONOLLOSA J A R ,et al. Multilingual machine translation:closing the gap between shared and language-specific encoder-decoders[C]// Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics. Stroudsburg:Association for Computational Linguistics, 2021: 944-948.

Metrics

Recommended 0

No Suggested Reading articles found!

任务	数据集	评测指标	评估数据	训练集	验证集	测试集
完形填空	CMRC2017	EM	测试集	300 000	2 000	3 000
阅读理解	CMRC2018	EM	验证集	10 142	1 002	3 219
NLI	CMNLI	ACC	验证集	391 782	12 426	13 880
文本匹配	AFQMC	ACC	验证集	34 334	4 316	3 861
实体识别	CLUENER	ACC	验证集	10 748	1 343	1 345
文本分类	TNEWS	ACC	验证集	53 360	10 000	10 000

任务	数据集	prompt
完形填空	CMRC2017	填空：\n{text}\n{question}{ans}
阅读理解	CMRC2018	阅读理解：\n{text}\n问：{question}\n答：{ans}
NLI	CMNLI	关系判断：\n{text1}\n{text2}\n选项：{options}\n答案：{ans}
文本匹配	AFQMC	语义相似度判断：\n{text1}\n{text2}\n选项：{options}\n答案：{ans}
实体识别	CLUENER	实体识别：\n{text}\n{entity_type}：{ans}
文本分类	TNEWS	文本分类：\n{text}\n选项：{options}\n答案：{ans}

基础模型	参数量	新增参数	权重衰减	dropout	优化器	学习率	回顾比例
鹏程·盘古（2.6 B）	26亿	0	0.1	0.1	Adam	10^-6~10^-5	1%～5%

任务	原始	混合	独立	持续
CMRC2017	37.83	63.2	62.033	58.566
CMRC2018	1.21	46.039	47.934	43.46
CMNLI	50.2	71.96	71.72	68.96
AFQMC	59.29	71.825	71.362	70.25
CLUENER	0	60.522	60.567	60.748
TNEWS	60.95	84.04	83.94	83.84
平均	34.913	66.264	66.259	64.304

Exploration of the continual learning ability that supports the application ecological evolution of the large-scale pretraining Peng Cheng series open source models

RichHTML

PDF下载

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 11

References 42

Related Articles 1

Metrics

Recommended 0