通信学报 ›› 2024, Vol. 45 ›› Issue (4): 65-72.doi: 10.11959/j.issn.1000-436x.2024066

• 学术论文 • 上一篇    下一篇

基于单语优先级采样自训练神经机器翻译的研究

张笑燕, 逄磊, 杜晓峰(), 陆天波, 夏亚梅   

  1. 北京邮电大学计算机学院(国家示范性软件学院),北京 100876
  • 收稿日期:2024-01-18 修回日期:2024-03-11 出版日期:2024-04-30 发布日期:2024-05-27
  • 通讯作者: 杜晓峰 E-mail:dxf@bupt.edu.cn
  • 作者简介:张笑燕(1973- ),女,山东烟台人,博士,北京邮电大学教授,主要研究方向为软件工程理论、移动互联网软件与大数据分析。
    逄磊 (1999- ),男,山东青岛人,北京邮电大学硕士生,主要研究方向为自然语言处理、机器翻译等。
    杜晓峰(1973- ),男,陕西韩城人,北京邮电大学讲师,主要研究方向为云计算与大数据分析。
    陆天波(1977- ),男,贵州毕节人,博士,北京邮电大学教授,主要研究方向为网络与信息安全、安全软件工程和P2P计算。
    夏亚梅(1976- ),女,甘肃天水人,博士,北京邮电大学副教授,主要研究方向为网络安全与数据挖掘。
  • 基金资助:
    国家自然科学基金资助项目(62162060)

Research on self-training neural machine translation based on monolingual priority sampling

Xiaoyan ZHANG, Lei PANG, Xiaofeng DU(), Tianbo LU, Yamei XIA   

  1. School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts and Telecommunications, Beijing 100876, China
  • Received:2024-01-18 Revised:2024-03-11 Online:2024-04-30 Published:2024-05-27
  • Contact: Xiaofeng DU E-mail:dxf@bupt.edu.cn
  • Supported by:
    The National Natural Science Foundation of China(62162060)

摘要:

为了提高神经机器翻译(NMT)性能,改善不确定性过高的单语在自训练过程中对NMT模型的损害,提出了一种基于优先级采样的自训练神经机器翻译模型。首先,通过引入语法依存分析构建语法依存树并计算单语单词重要程度。然后,构建单语词典并基于单语单词的重要程度和不确定性定义优先级。最后,计算单语优先级并基于优先级进行采样,进而合成平行数据集,作为学生NMT的训练输入。在大规模WMT英德部分数据集上的实验结果表明,所提模型能有效提升NMT的翻译效果,并改善不确定性过高对模型的损害。

关键词: 机器翻译, 数据增强, 自训练, 不确定性, 语法依存

Abstract:

To enhance the performance of neural machine translation (NMT) and ameliorate the detrimental impact of high uncertainty in monolingual data during the self-training process, a self-training NMT model based on priority sampling was proposed. Initially, syntactic dependency trees were constructed and the importance of monolingual tokenization was assessed using grammar dependency analysis. Subsequently, a monolingual lexicon was built, and priority was defined based on the importance of monolingual tokenization and uncertainty. Finally, monolingual priorities were computed, and sampling was carried out based on these priorities, consequently generating a synthetic parallel dataset for training the student NMT model. Experimental results on a large-scale subset of the WMT English to German dataset demonstrate that the proposed model effectively enhances NMT translation performance and mitigates the impact of high uncertainty on the model.

Key words: machine translation, data augmentation, self-training, uncertainty, syntactic dependency

中图分类号: 

No Suggested Reading articles found!