通信学报 ›› 2022, Vol. 43 ›› Issue (7): 163-171.doi: 10.11959/j.issn.1000-436x.2022142

• 学术论文 • 上一篇    下一篇

基于正样本对比与掩蔽重建的自监督语音表示学习

张文林, 刘雪鹏, 牛铜, 陈琦, 屈丹   

  1. 信息工程大学信息系统工程学院,河南 郑州 450001
  • 修回日期:2022-06-20 出版日期:2022-07-25 发布日期:2022-06-01
  • 作者简介:张文林(1982- ),男,湖北黄冈人,博士,信息工程大学副教授,主要研究方向为语音信号处理、语音识别、机器学习
    刘雪鹏(1996- ),男,山东泰安人,信息工程大学硕士生,主要研究方向为智能信息处理、无监督学习、语音表示学习
    牛铜(1984- ),男,河南安阳人,博士,信息工程大学副教授,主要研究方向为深度学习、语音信号处理和语音识别
    陈琦(1974- ),男,河南郑州人,信息工程大学副教授,主要研究方向为语音信号处理、语音识别和音频水印
    屈丹(1974- ),女,吉林九台人,博士,信息工程大学教授,主要研究方向为机器学习、深度学习和语音识别
  • 基金资助:
    国家自然科学基金资助项目(61673395);国家自然科学基金资助项目(62171470)

Self-supervised speech representation learning based on positive sample comparison and masking reconstruction

Wenlin ZHANG, Xuepeng LIU, Tong NIU, Qi CHEN, Dan QU   

  1. College of Information System Engineering, Information Engineering University, Zhengzhou 450001, China
  • Revised:2022-06-20 Online:2022-07-25 Published:2022-06-01
  • Supported by:
    The National Natural Science Foundation of China(61673395);The National Natural Science Foundation of China(62171470)

摘要:

针对现有基于对比预测的自监督语音表示学习方法在训练时需要构建大量负样本,其学习效果依赖于大批次训练,需要耗费大量计算资源的问题,提出了一种仅使用正样本进行语音对比学习的方法,并将其与掩蔽重建任务相结合得到一种多任务自监督语音表示学习方法,在降低训练复杂度的同时提高语音表示学习的性能。其中,正样本对比学习任务,借鉴图像自监督表示学习中SimSiam方法的思想,采用孪生网络架构对原始语音信号进行两次数据增强,并使用相同的编码器进行处理,将一个分支经过一个前向网络,另一个分支使用梯度停止策略,调整模型参数以最大化2个分支输出的相似度。整个训练过程中不需要构造负样本,可采用小批次进行训练,大幅提高了学习效率。使用 LibriSpeech 语料库进行自监督表示学习,并在多种下游任务中进行微调测试,对比实验表明,所提方法得到的模型在多个任务中均达到或者超过了现有主流语音表示学习模型的性能。

关键词: 语音表示, 自监督学习, 无监督学习, 孪生网络

Abstract:

To solve the problem that existing contrastive prediction based self-supervised speech representation learning methods need to construct a large number of negative samples, and their performance depends on large training batches, requiring a lot of computing resources, a new speech representation learning method based on contrastive learning using only positive samples was proposed.Combined with reconstruction loss, the proposed method could obtain better representation with lower training cost.The proposed method was inspired by the idea of the SimSiam method in image self-supervised representation learning.Using the siamese network architecture, two random augmentations of the input speech signals were processed by the same encoder network, then a feed-forward network was applied on one side, and a stop-gradient operation was applied on the other side.The model was trained to maximize the similarity between two sides.During training processing, negative samples were not required, so small batch size could be used and training efficiency was improved.Experimental results show that the representation model obtained by the new method achieves or exceeds the performance of existing mainstream speech representation learning models in multiple downstream tasks.

Key words: speech representation, self-supervised learning, unsupervised learning, siamese network

中图分类号: 

No Suggested Reading articles found!