基于正样本对比与掩蔽重建的自监督语音表示学习

doi:10.11959/j.issn.1000-436x.2022142

摘要/Abstract

摘要：

针对现有基于对比预测的自监督语音表示学习方法在训练时需要构建大量负样本，其学习效果依赖于大批次训练，需要耗费大量计算资源的问题，提出了一种仅使用正样本进行语音对比学习的方法，并将其与掩蔽重建任务相结合得到一种多任务自监督语音表示学习方法，在降低训练复杂度的同时提高语音表示学习的性能。其中，正样本对比学习任务，借鉴图像自监督表示学习中SimSiam方法的思想，采用孪生网络架构对原始语音信号进行两次数据增强，并使用相同的编码器进行处理，将一个分支经过一个前向网络，另一个分支使用梯度停止策略，调整模型参数以最大化2个分支输出的相似度。整个训练过程中不需要构造负样本，可采用小批次进行训练，大幅提高了学习效率。使用 LibriSpeech 语料库进行自监督表示学习，并在多种下游任务中进行微调测试，对比实验表明，所提方法得到的模型在多个任务中均达到或者超过了现有主流语音表示学习模型的性能。

关键词: 语音表示, 自监督学习, 无监督学习, 孪生网络

Abstract:

To solve the problem that existing contrastive prediction based self-supervised speech representation learning methods need to construct a large number of negative samples, and their performance depends on large training batches, requiring a lot of computing resources, a new speech representation learning method based on contrastive learning using only positive samples was proposed.Combined with reconstruction loss, the proposed method could obtain better representation with lower training cost.The proposed method was inspired by the idea of the SimSiam method in image self-supervised representation learning.Using the siamese network architecture, two random augmentations of the input speech signals were processed by the same encoder network, then a feed-forward network was applied on one side, and a stop-gradient operation was applied on the other side.The model was trained to maximize the similarity between two sides.During training processing, negative samples were not required, so small batch size could be used and training efficiency was improved.Experimental results show that the representation model obtained by the new method achieves or exceeds the performance of existing mainstream speech representation learning models in multiple downstream tasks.

Key words: speech representation, self-supervised learning, unsupervised learning, siamese network

中图分类号:

TN912.34

张文林, 刘雪鹏, 牛铜, 陈琦, 屈丹. 基于正样本对比与掩蔽重建的自监督语音表示学习[J]. 通信学报, 2022, 43(7): 163-171.

Wenlin ZHANG, Xuepeng LIU, Tong NIU, Qi CHEN, Dan QU. Self-supervised speech representation learning based on positive sample comparison and masking reconstruction[J]. Journal on Communications, 2022, 43(7): 163-171.

图/表 7

图1

表1

表2

表3

表4

表5

表6

参考文献 34

[1]	陈虹洁 . 面向低资源场景的语音表示学习及其应用[D]. 西安:西北工业大学, 2018.
	CHEN H J . Low-resource speech representation learning and its applications[D]. Xi’an:Northwestern Polytechnical University, 2018.
[2]	朱毅 . 基于深度学习的表示学习算法研究[D]. 合肥:合肥工业大学, 2018.
	ZHU Y . Research on deep learning-based representation learning algorithms[D]. Hefei:Hefei University of Technology, 2018.
[3]	刘雪鹏, 张文林 . 自监督语音表示学习综述[C]// 2021 年第十六届全国人机语音通信学术会议录,北京:中国中文信息学会, 2021: 284-293.
	LIU X P , ZHANG W L . An overview of self-supervised speech representation learning[C]// 2021 National Conference on Man-Machine Speech Communication 2021. Beijing:Chinese Information Processing Society of China, 2021: 284-293.
[4]	YANG S W , CHI P H , CHUANG Y S ,et al. SUPERB:speech processing universal performance benchmark[C]// Proceedings of Interspeech 2021. Piscataway:IEEE Press, 2021: 1194-1198.
[5]	OORD A V D , LI Y Z , VINYALS O . Representation learning with contrastive predictive coding[J]. arXiv Preprint,arXiv:1807.03748, 2018.
[6]	SCHNEIDER S , BAEVSKI A , COLLOBERT R ,et al. wav2vec:unsupervised pre-training for speech recognition[C]// Proceedings of Interspeech 2019. Piscataway:IEEE Press, 2019: 3465-3469.
[7]	BAEVSKI A , SCHNEIDER S , AULI M . vq-wav2vec:self-supervised learning of discrete speech representations[J]. arXiv Preprint,arXiv:1910.05453, 2019.
[8]	BAEVSKI A , ZHOU H , MOHAMED A ,et al. wav2vec 2.0:a framework for self-supervised learning of speech representations[J]. Advances in Neural Information Processing Systems, 2020,33: 12449-12460.
[9]	GRILL J B , STRUB F , ALTCHé F , ,et al. Bootstrap your own latent-a new approach to self-supervised learning[J]. Advances in Neural Information Processing Systems, 2020,33: 21271-21284.
[10]	CHEN X L , HE K M . Exploring simple Siamese representation learning[C]// Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2021: 15745-15753.
[11]	HE K M , ZHANG X Y , REN S Q ,et al. Deep residual learning for image recognition[C]// Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2016: 770-778.
[12]	VASWANI A , SHAZEER N , PARMAR N ,et al. Attention is all you need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Massachusetts:MIT Press, 2017: 6000-6010.
[13]	PANAYOTOV V , CHEN G G , POVEY D ,et al. Librispeech:an ASR corpus based on public domain audio books[C]// Proceedings of 2015 IEEE International Conference on Acoustics,Speech and Signal Processing. Piscataway:IEEE Press, 2015: 5206-5210.
[14]	HSU W N , TSAI Y H H , BOLTE B ,et al. HuBERT:how much can a bad teacher benefit ASR pre-training?[C]// Proceedings of 2021 IEEE International Conference on Acoustics,Speech and Signal Processing. Piscataway:IEEE Press, 2021: 6533-6537.
[15]	CHEN S Y , WANG C Y , CHEN Z Y ,et al. WavLM:large-scale self-supervised pre-training for full stack speech processing[J]. arXiv Preprint,arXiv:2110.13900, 2021.
[16]	CHUNG Y A , HSU W N , TANG H ,et al. An unsupervised autoregressive model for speech representation learning[C]// Proceedings of Interspeech 2019. Piscataway:IEEE Press, 2019: 146-150.
[17]	DEVLIN J , CHANG M W , LEE K ,et al. BERT:pre-training of deep bidirectional transformers for language understanding[C]// 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.[S.L.:s. n], 2019: 4171-4186.
[18]	CHUNG Y A , TANG H , GLASS J . Vector-quantized autoregressive predictive coding[C]// Proceedings of Interspeech 2020. Piscataway:IEEE Press, 2020: 3760-3764.
[19]	LIU A T , YANG S W , CHI P H ,et al. Mockingjay:unsupervised speech representation learning with deep bidirectional transformer encoders[C]// Proceedings 2020 IEEE International Conference on Acoustics,Speech and Signal Processing. Piscataway:IEEE Press, 2020: 6419-6423.
[20]	LIU A T , LI S W , LEE H Y . TERA:self-supervised learning of transformer encoder representation for speech[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing, 2021,29: 2351-2366.
[21]	YUE X H , LI H Z . Phonetically motivated self-supervised speech representation learning[C]// Proceedings of Interspeech 2021. Piscataway:IEEE Press, 2021: 746-750.
[22]	JIANG D W , LI W B , CAO M ,et al. Speech SimCLR:combining contrastive and reconstruction objective for self-supervised speech representation learning[C]// Proceedings of Interspeech 2021. Piscataway:IEEE Press, 2021: 1544-1548.
[23]	CHEN T , KORNBLITH S , NOROUZI M ,et al. A simple framework for contrastive learning of visual representations[C]// 2021 International conference on machine learning. New York:PMLR, 2020: 1597-1607.
[24]	ZAIEM S , PARCOLLET T , ESSID S . Pretext Tasks selection for multitask self-supervised speech representation learning[J]. arXiv Preprint,arXiv:2107.00594, 2021.
[25]	CHICCO D . Siamese neural networks:an overview[J]. Artificial Neural Networks, 2021: 73-94.
[26]	HE K M , FAN H Q , WU Y X ,et al. Momentum contrast for unsupervised visual representation learning[C]// Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2020: 9726-9735.
[27]	PARK D S , CHAN W , ZHANG Y ,et al. SpecAugment:a simple data augmentation method for automatic speech recognition[C]// Proceedings of Interspeech 2019. Piscataway:IEEE Press, 2019: 2613-2617.
[28]	GAROFOLO J S , LAMEL L F , FISHER W M ,et al. DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM.NIST speech disc 1-1.1[R]. NASA STI/Recon Technical Report N, 1993.
[29]	GULATI A , QIN J , CHIU C C ,et al. Conformer:convolution-augmented transformer for speech recognition[C]// Proceedings of Interspeech 2020. Piscataway:IEEE Press, 2020: 5036-5040.
[30]	WATANABE S , HORI T , KARITA S ,et al. ESPnet:end-to-end speech processing toolkit[C]// Proceedings of Interspeech 2018. Piscataway:IEEE Press, 2018: 2207-2211.
[31]	LUGOSCH L , RAVANELLI M , IGNOTO P ,et al. Speech model pretraining for end-to-end spoken language understanding[C]// Proceedings of Interspeech. Piscataway:IEEE Press, 2019: 814-818.
[32]	NAGRANI A , CHUNG J S , XIE W D ,et al. Voxceleb:large-scale speaker verification in the wild[J]. Computer Speech ＆ Language, 2020,60:101027.
[33]	ANGUERA X , RODRIGUEZ-FUENTES L J , BUZO A ,et al. QUESST2014:evaluating query-by-example speech search in a zero-resource setting with real-life queries[C]// Proceedings of 2015 IEEE International Conference on Acoustics,Speech and Signal Processing. Piscataway:IEEE Press, 2015: 5833-5837.
[34]	SNYDER D , GARCIA-ROMERO D , SELL G ,et al. X-vectors:robust DNN embeddings for speaker recognition[C]// Proceedings of 2018 IEEE International Conference on Acoustics,Speech and Signal Processing. Piscataway:IEEE Press, 2018: 5329-5333.

增强概率	timit-plin	lib-plin	lib-phid	spk-fr
单路增强	70.86%	70.97%	79.27%	99.80%
p_a=1.0	66.97%	64.81%	73.50%	97.59%
p_a=0.5	71.39%	71.25%	79.77%	99.76%
p_a=0.8	70.27%	69.85%	78.50%	99.62%

测试模型	timit-plin	lib-plin	lib-phid	spk-fr
CPC	64.8%	64.6%	72.5%	97.4%
Mockingjay^[19]	64.7%	60.1%	75.3%	83.4%
TERA^[20]	65.6%	65.2%	77.3%	98.9%
pMPC^[21]	68.1%	67.3%	78.8%	99.5%
sia100-rec	69.66%	70.52%	78.84%	99.65%
sia100-con	49.72%	46.32%	54.02%	75.95%
sia100-mt	71.39%	71.25%	79.77%	99.76%

测试模型	参数量/个	timit-plin	lib-plin	lib-phid	spk-fr
Mockingjay^[19]	22×10⁶	58.4%	67.0%	79.1%	99.3%
TERA^[20]	222×10⁶	70.0%	71.2%	80.2%	99.2%
wav2vec2.0-Base^[8]	952×10⁶	73.26%	75.89%	85.54%	99.40%
sia960-mt	232×10⁶	74.34%	74.80%	82.40%	99.69%
sia960-mt-bs32	232×10⁶	75.33%	76.36%	83.17%	99.78%

模型	参数量/个	训练资源	预训练时间/天（单个GPU）下游任务训练速度	推理速度
本文模型	23×10⁶	2080Ti GPU×4	381	1
wav2vec2.0-Base^[8]	95×10⁶	V100 GPU×64	1024.3	2.2

模型	参数量/个	语言模型	标注数据/h	WER
wav2vec-Large^[6]	33×10⁶	Trans	100	6.9%
TERA^[20]	22×10⁶	Trans	100	6.0%
Speech SimCLR^[22]	30×10⁶	Trans	100	5.7%
sia960-mt	23×10⁶	Trans	100	5.7%
sia960-mt-bs32	23×10⁶	Trans	100	5.5%
wav2vec 2.0-Base^[8]	95×10⁶	4-gram	100	3.4%
HuBERT-Base^[14]	95×10⁶	4-gram	100	3.4%
wav2vec 2.0-Base^[8]	95×10⁶	Trans	960	2.1%
HuBERT-Base^[14]	95×10⁶	Trans	960	2.2%
sia960-mt	23×10⁶	Trans	960	2.4%
sia960-mt-bs32	23×10⁶	Trans	960	2.3%