[1] |
陈虹洁 . 面向低资源场景的语音表示学习及其应用[D]. 西安:西北工业大学, 2018.
|
|
CHEN H J . Low-resource speech representation learning and its applications[D]. Xi’an:Northwestern Polytechnical University, 2018.
|
[2] |
朱毅 . 基于深度学习的表示学习算法研究[D]. 合肥:合肥工业大学, 2018.
|
|
ZHU Y . Research on deep learning-based representation learning algorithms[D]. Hefei:Hefei University of Technology, 2018.
|
[3] |
刘雪鹏, 张文林 . 自监督语音表示学习综述[C]// 2021 年第十六届全国人机语音通信学术会议录,北京:中国中文信息学会, 2021: 284-293.
|
|
LIU X P , ZHANG W L . An overview of self-supervised speech representation learning[C]// 2021 National Conference on Man-Machine Speech Communication 2021. Beijing:Chinese Information Processing Society of China, 2021: 284-293.
|
[4] |
YANG S W , CHI P H , CHUANG Y S ,et al. SUPERB:speech processing universal performance benchmark[C]// Proceedings of Interspeech 2021. Piscataway:IEEE Press, 2021: 1194-1198.
|
[5] |
OORD A V D , LI Y Z , VINYALS O . Representation learning with contrastive predictive coding[J]. arXiv Preprint,arXiv:1807.03748, 2018.
|
[6] |
SCHNEIDER S , BAEVSKI A , COLLOBERT R ,et al. wav2vec:unsupervised pre-training for speech recognition[C]// Proceedings of Interspeech 2019. Piscataway:IEEE Press, 2019: 3465-3469.
|
[7] |
BAEVSKI A , SCHNEIDER S , AULI M . vq-wav2vec:self-supervised learning of discrete speech representations[J]. arXiv Preprint,arXiv:1910.05453, 2019.
|
[8] |
BAEVSKI A , ZHOU H , MOHAMED A ,et al. wav2vec 2.0:a framework for self-supervised learning of speech representations[J]. Advances in Neural Information Processing Systems, 2020,33: 12449-12460.
|
[9] |
GRILL J B , STRUB F , ALTCHé F , ,et al. Bootstrap your own latent-a new approach to self-supervised learning[J]. Advances in Neural Information Processing Systems, 2020,33: 21271-21284.
|
[10] |
CHEN X L , HE K M . Exploring simple Siamese representation learning[C]// Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2021: 15745-15753.
|
[11] |
HE K M , ZHANG X Y , REN S Q ,et al. Deep residual learning for image recognition[C]// Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2016: 770-778.
|
[12] |
VASWANI A , SHAZEER N , PARMAR N ,et al. Attention is all you need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Massachusetts:MIT Press, 2017: 6000-6010.
|
[13] |
PANAYOTOV V , CHEN G G , POVEY D ,et al. Librispeech:an ASR corpus based on public domain audio books[C]// Proceedings of 2015 IEEE International Conference on Acoustics,Speech and Signal Processing. Piscataway:IEEE Press, 2015: 5206-5210.
|
[14] |
HSU W N , TSAI Y H H , BOLTE B ,et al. HuBERT:how much can a bad teacher benefit ASR pre-training?[C]// Proceedings of 2021 IEEE International Conference on Acoustics,Speech and Signal Processing. Piscataway:IEEE Press, 2021: 6533-6537.
|
[15] |
CHEN S Y , WANG C Y , CHEN Z Y ,et al. WavLM:large-scale self-supervised pre-training for full stack speech processing[J]. arXiv Preprint,arXiv:2110.13900, 2021.
|
[16] |
CHUNG Y A , HSU W N , TANG H ,et al. An unsupervised autoregressive model for speech representation learning[C]// Proceedings of Interspeech 2019. Piscataway:IEEE Press, 2019: 146-150.
|
[17] |
DEVLIN J , CHANG M W , LEE K ,et al. BERT:pre-training of deep bidirectional transformers for language understanding[C]// 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.[S.L.:s. n], 2019: 4171-4186.
|
[18] |
CHUNG Y A , TANG H , GLASS J . Vector-quantized autoregressive predictive coding[C]// Proceedings of Interspeech 2020. Piscataway:IEEE Press, 2020: 3760-3764.
|
[19] |
LIU A T , YANG S W , CHI P H ,et al. Mockingjay:unsupervised speech representation learning with deep bidirectional transformer encoders[C]// Proceedings 2020 IEEE International Conference on Acoustics,Speech and Signal Processing. Piscataway:IEEE Press, 2020: 6419-6423.
|
[20] |
LIU A T , LI S W , LEE H Y . TERA:self-supervised learning of transformer encoder representation for speech[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing, 2021,29: 2351-2366.
|
[21] |
YUE X H , LI H Z . Phonetically motivated self-supervised speech representation learning[C]// Proceedings of Interspeech 2021. Piscataway:IEEE Press, 2021: 746-750.
|
[22] |
JIANG D W , LI W B , CAO M ,et al. Speech SimCLR:combining contrastive and reconstruction objective for self-supervised speech representation learning[C]// Proceedings of Interspeech 2021. Piscataway:IEEE Press, 2021: 1544-1548.
|
[23] |
CHEN T , KORNBLITH S , NOROUZI M ,et al. A simple framework for contrastive learning of visual representations[C]// 2021 International conference on machine learning. New York:PMLR, 2020: 1597-1607.
|
[24] |
ZAIEM S , PARCOLLET T , ESSID S . Pretext Tasks selection for multitask self-supervised speech representation learning[J]. arXiv Preprint,arXiv:2107.00594, 2021.
|
[25] |
CHICCO D . Siamese neural networks:an overview[J]. Artificial Neural Networks, 2021: 73-94.
|
[26] |
HE K M , FAN H Q , WU Y X ,et al. Momentum contrast for unsupervised visual representation learning[C]// Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2020: 9726-9735.
|
[27] |
PARK D S , CHAN W , ZHANG Y ,et al. SpecAugment:a simple data augmentation method for automatic speech recognition[C]// Proceedings of Interspeech 2019. Piscataway:IEEE Press, 2019: 2613-2617.
|
[28] |
GAROFOLO J S , LAMEL L F , FISHER W M ,et al. DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM.NIST speech disc 1-1.1[R]. NASA STI/Recon Technical Report N, 1993.
|
[29] |
GULATI A , QIN J , CHIU C C ,et al. Conformer:convolution-augmented transformer for speech recognition[C]// Proceedings of Interspeech 2020. Piscataway:IEEE Press, 2020: 5036-5040.
|
[30] |
WATANABE S , HORI T , KARITA S ,et al. ESPnet:end-to-end speech processing toolkit[C]// Proceedings of Interspeech 2018. Piscataway:IEEE Press, 2018: 2207-2211.
|
[31] |
LUGOSCH L , RAVANELLI M , IGNOTO P ,et al. Speech model pretraining for end-to-end spoken language understanding[C]// Proceedings of Interspeech. Piscataway:IEEE Press, 2019: 814-818.
|
[32] |
NAGRANI A , CHUNG J S , XIE W D ,et al. Voxceleb:large-scale speaker verification in the wild[J]. Computer Speech & Language, 2020,60:101027.
|
[33] |
ANGUERA X , RODRIGUEZ-FUENTES L J , BUZO A ,et al. QUESST2014:evaluating query-by-example speech search in a zero-resource setting with real-life queries[C]// Proceedings of 2015 IEEE International Conference on Acoustics,Speech and Signal Processing. Piscataway:IEEE Press, 2015: 5833-5837.
|
[34] |
SNYDER D , GARCIA-ROMERO D , SELL G ,et al. X-vectors:robust DNN embeddings for speaker recognition[C]// Proceedings of 2018 IEEE International Conference on Acoustics,Speech and Signal Processing. Piscataway:IEEE Press, 2018: 5329-5333.
|