基于超轻量通道注意力的端对端语音增强方法

doi:10.11959/j.issn.2096-6652.202136

Abstract

Abstract:

The full convolutional time-domain audio separation network (Conv-TasNet) is a state-of-the-art end-to-end speech separation model which was proposed recently.The Conv-TasNet used dilated convolution to expand the receptive field and fuse more speech features in space, which greatly improved the speech separation performance of the network, but at the same time ignored the importance of information across different convolution channels.An end-to-end speech enhancement method based on ultra-lightweight channel attention was proposed, which effectively combined Conv-TasNet and channel attention.At the same time, a group of filters was added to the Conv-TasNet codec to improve the speech feature extraction ability of the network.This method can make convolutional neural network combine spatial information and channel information more effectively to improve the speech enhancement effect.Experiment shows that the proposed model can effectively improve the performance of speech enhancement when the model capacity is only increased by about 0.02%.

Key words: speech enhancement, end-to-end speech separation network, channel attention

CLC Number:

TP183

Yi HONG,Chengli SUN,Yan LENG. End-to-end speech enhancement based on ultra-lightweight channel attention[J]. Chinese Journal of Intelligent Science and Technology, 2021, 3(3): 351-358.

Figures/Tables 12

方法	SDR/dB	SI-SDR/dB	PESQ	STOI	网络参数量/M
Conv-TasNet	14.668	13.895	2.835	0.916	4.984
Conv-TasNet+SENet	14.700	13.959	2.826	0.916	5.773
Conv-TasNet+ECA-Net	14.827	14.094	2.870	0.918	4.985
Conv-TasNet+Deep-ECA-Net	$14 . 888$	$14 . 095$	$2 . 903$	$0 . 920$	$6 . 173$

方法	SDR/dB	SI-SDR/dB	PESQ	STOI
Conv-TasNet	14.668	13.895	2.835	0.916
ECA-Net1	14.667	13.898	2.836	0.916
ECA-Net2	$14 . 827$	$14 . 094$	$2 . 870$	$0 . 918$
ECA-Net3	14.812	14.030	2.866	0.917

References 33

[1]	LOIZOU P . Speech enhancement:theory and practice[M]. Boca Raton: CRC Press, 2007.
[2]	ZHOU N , DU J , TU Y H ,et al. A speech enhancement neural network architecture with SNR-progressive multi-target learning for robust speech recognition[C]// Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. Piscataway:IEEE Press, 2019: 873-877.
[3]	WU B , YU M , CHEN L W ,et al. Distortionless multi-channel target speech enhancement for overlapped speech recognition[J]. arXiv preprints,2020,arXiv:2007.01566.
[4]	张钹 . 人工智能进入后深度学习时代[J]. 智能科学与技术学报, 2019,1(1): 4-6.
	ZHANG B . Artificial intelligence is entering the post deep-learning era[J]. Chinese Journal of Intelligent Science and Technology, 2019,1(1): 4-6.
[5]	GRZYWALSKI T , DRGAS S . Using recurrences in time and frequency within U-Net architecture for speech enhancement[C]// Proceedings of the 2019 IEEE International Conference on Acoustics,Speech and Signal Processing. Piscataway:IEEE Press, 2019: 6970-6974.
[6]	TAN K , WANG D L . A convolutional recurrent neural network for real-time speech enhancement[C]// Proceedings of the INTERSPEECH 2018.[S.l.:s.n.], 2018: 3229-3233.
[7]	WANG D , LIM J . The unimportance of phase in speech enhancement[J]. IEEE Transactions on Acoustics,Speech,and Signal Processing, 1982,30(4): 679-681.
[8]	PALIWAL K , WóJCICKI K , SHANNON B . The importance of phase in speech enhancement[J]. Speech Communication, 2011,53(4): 465-494.
[9]	LU X G , TSAO Y , MATSUDA S ,et al. Speech enhancement based on deep denoising autoencoder[C]// Proceedings of the INTERSPEECH 2018.[S.l.:s.n.], 2018: 436-440.
[10]	DEN OORD A V , DIELEMAN S , ZEN H ,et al. WaveNet:a generative model for raw audio[J]. arXiv preprint,2016,arXiv:1609.03499.
[11]	PANDEY A , WANG D L . A new framework for CNN-based speech enhancement in the time domain[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing, 2019,27(7): 1179-1188.
[12]	LUO Y , MESGARANI N . Conv-TasNet:surpassing ideal time- frequency magnitude masking for speech separation[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing, 2019,27(8): 1256-1266.
[13]	BAI S J , KOLTER J Z , KOLTUN V . An empirical evaluation of generic convolutional and recurrent networks for sequence modeling[J]. arXiv preprint,2018,arXiv:1803.01271.
[14]	SHI Z Q , LIN H B , LIU L ,et al. Deep attention gated dilated temporal convolutional networks with intra-parallel convolutional modules for end-to-end monaural speech separation[C]// Proceedings of the INTERSPEECH 2019.[S.l.:s.n.], 2019: 3183-3187.
[15]	SHI Z Q , LIN H B , LIU L ,et al. End-to-end monaural speech separation with multi-scale dynamic weighted gated dilated convolutional pyramid network[C]// Proceedings of the INTERSPEECH 2019.[S.l.:s.n.], 2019: 4614-4618.
[16]	马玮良, 彭轩, 熊倩 ,等. 深度学习中的内存管理问题研究综述[J]. 大数据, 2020,6(4): 56-68.
	MA W L , PENG X , XIONG Q ,et al. Memory management in deep learning:a survey[J]. Big Data Research, 2020,6(4): 56-68.
[17]	COSENTINO J , PARIENTE M , CORNELL S ,et al. LibriMix:an open-source dataset for generalizable speech separation[J]. arXiv preprint,2020,arXiv:2005.11262.
[18]	HE K M , ZHANG X Y , REN S Q ,et al. Delving deep into rectifiers:surpassing human-level performance on ImageNet classification[C]// Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway:IEEE Press, 2015: 1026-1034.
[19]	LEA C , VIDAL R , REITER A ,et al. Temporal convolutional networks:a unified approach to action segmentation[M]. Lecture notes in computer science. Cham: Springer International Publishing, 2016.
[20]	HE K M , ZHANG X Y , REN S Q ,et al. Deep residual learning for image recognition[C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2016: 770-778.
[21]	HOWARD A G , ZHU M L , CHEN B ,et al. MobileNets:efficient convolutional neural networks for mobile vision applications[J]. arXiv preprint,2017,arXiv:1704.04861.
[22]	LUO Y , MESGARANI N . TaSNet:time-domain audio separation network for real-time,single-channel speech separation[C]// Proceedings of the 2018 IEEE International Conference on Acoustics,Speech and Signal Processing. Piscataway:IEEE Press, 2018: 696-700.
[23]	WANG Q L , WU B G , ZHU P F ,et al. ECA-Net:efficient channel attention for deep convolutional neural networks[C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2020: 11531-11539.
[24]	HU J , SHEN L , ALBANIE S ,et al. Squeeze-and-excitation networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020,42(8): 2011-2023.
[25]	SZEGEDY C , LIU W JIA Y Q ,et al. Going deeper with convolutions[C]// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2014.
[26]	BELL S , ZITNICK C L , BALA K ,et al. Inside-Outside net:detecting objects in context with skip pooling and recurrent neural networks[C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2016: 2874-2883.
[27]	ZHANG X Y , ZHOU X Y , LIN M X ,et al. ShuffleNet:an extremely efficient convolutional neural network for mobile devices[C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2018: 6848-6856.
[28]	KRIZHEVSKY A , SUTSKEVER I , HINTON G E . ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017,60(6): 84-90.
[29]	PANAYOTOV V , CHEN G G , POVEY D ,et al. LibriSpeech:an ASR corpus based on public domain audio books[C]// Proceedings of the 2015 IEEE International Conference on Acoustics,Speech and Signal Processing. Piscataway:IEEE Press, 2015: 5206-5210.
[30]	WICHERN G , ANTOGNINI J , FLYNN M ,et al. WHAM!:extending speech separation to noisy environments[C]// Proceedings of the INTERSPEECH 2019.[S.l.:s.n.], 2019.
[31]	ROUX J L , WISDOM S , ERDOGAN H ,et al. SDR – half-baked or well done?[C]// Proceedings of the 2019 IEEE International Conference on Acoustics,Speech and Signal Processing. Piscataway:IEEE Press, 2019: 626-630.
[32]	RIX ANTONY W , BEERENDS JOHN G , HOLLIER MICHAEL P ,et al. Perceptual evaluation of speech quality (PESQ) - a new method for speech quality assessment of telephone networks and codecs[C]// Proceedings of the 2001 IEEE International Conference on Acoustics,Speech,and Signal Processing. Piscataway:IEEE Press, 2002: 749-752.
[33]	TAAL C H , HENDRIKS R C , HEUSDENS R ,et al. An algorithm for intelligibility prediction of time–frequency weighted noisy speech[J]. IEEE Transactions on Audio Speech ＆ Language Processing, 2011,19(7): 2125-2136.

Metrics

Recommended 0

No Suggested Reading articles found!

子集	时长/h	说话人平均时长/min	女说话者数量/个	男说话者数量/个	总人数/个
验证集	5.4	8	20	20	40
测试集	5.4	8	20	20	40
训练集	100.6	25	125	126	251

End-to-end speech enhancement based on ultra-lightweight channel attention

RichHTML

PDF下载

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 12

References 33

Related Articles 7

Metrics

Recommended 0

[1]	Zhe HUANG, Yongcai WANG, Deying LI. A survey of 3D object detection algorithms [J]. Chinese Journal of Intelligent Science and Technology, 2023, 5(1): 7-31.
[2]	Yan CHEN, Xueqin LUO, Wei LIANG, Yongfang XIE. Depression recognition based on emotional information fused with attentional mechanism [J]. Chinese Journal of Intelligent Science and Technology, 2022, 4(4): 600-609.
[3]	Qiang ZHANG, Wen WEN, Xiaodong ZHOU, Weihui LIU, Xiaoyu CHU. Research on the manipulator intelligent trajectory planning method based on the improved TD3 algorithm [J]. Chinese Journal of Intelligent Science and Technology, 2022, 4(2): 223-232.
[4]	Ganxiong ZENG, Xiao KE. 3D convolution-based image sequence feature extraction and self-attention for license plate recognition method [J]. Chinese Journal of Intelligent Science and Technology, 2021, 3(3): 268-279.
[5]	Li LI, Weiliang ZENG, Yonghui HUANG, Weijun SUN. Research on anti-spoofing method of face recognition based on semi-supervised learning [J]. Chinese Journal of Intelligent Science and Technology, 2021, 3(3): 370-380.
[6]	Guobin ZHANG,Xinying WANG. Research on data-driven modeling for photovoltaic characteristics based on hybrid neural network [J]. Chinese Journal of Intelligent Science and Technology, 2020, 2(2): 169-178.
[7]	Yating WEI,Zhiyong WANG,Shuyue ZHOU,Wei CHEN. Federated visualization:a new model for privacy-preserving visualization [J]. Chinese Journal of Intelligent Science and Technology, 2019, 1(4): 415-420.