基于超轻量通道注意力的端对端语音增强方法

doi:10.11959/j.issn.2096-6652.202136

智能科学与技术学报 ›› 2021, Vol. 3 ›› Issue (3): 351-358.doi: 10.11959/j.issn.2096-6652.202136

• 专刊：目标智能检测与识别 • 上一篇下一篇

基于超轻量通道注意力的端对端语音增强方法

洪依¹, 孙成立¹, 冷严²

¹ 南昌航空大学信息工程学院，江西南昌 330063
² 山东师范大学物理与电子科学学院，山东济南 250014

修回日期:2021-07-17 出版日期:2021-09-15 发布日期:2021-09-01
作者简介:洪依（1997− ），女，南昌航空大学信息工程学院硕士生，主要研究方向为信号处理、语音增强、回声消除等
孙成立（1975− ），男，博士后，南昌航空大学信息工程学院教授，主要研究方向为人工智能、语音信号处理、语音识别、语音增强等
冷严（1982− ），女，博士，山东师范大学物理与电子科学学院副教授，主要研究方向为音频信息处理、音频分类与检测
基金资助:
国家自然科学基金资助项目(61861033);江西省自然科学基金重点项目(20202ACBL202007);山东省自然科学基金资助项目(ZR2020MF020)

End-to-end speech enhancement based on ultra-lightweight channel attention

Yi HONG¹, Chengli SUN¹, Yan LENG²

¹ School of Information Engineering, Nanchang Hangkong University, Nanchang 330063, China
² School of Physics and Electronic, Shandong Normal University, Jinan 250014, China

Revised:2021-07-17 Online:2021-09-15 Published:2021-09-01
Supported by:
The National Natural Science Foundation of China(61861033);The Key Project of Natural Science Foundation of Jiangxi Province(20202ACBL202007);The Natural Science Foundation of Shandong Province(ZR2020MF020)

摘要/Abstract

摘要：

全卷积时域音频分离网络（Conv-TasNet）是近年提出的一种主流的端对端语音分离模型。Conv-TasNet利用膨胀卷积扩大感受野，使其在空间上可以融合更多语音特征，极大地提高了网络的语音分离性能，但同时忽略了信息在不同卷积通道间的重要性。基于此，提出一种基于超轻量通道注意力的端对端语音增强方法，该方法结合了Conv-TasNet和通道注意力，并在Conv-TasNet编解码器部分增加一组滤波器来提高网络语音特征提取能力，使卷积神经网络可以更有效地结合空间信息和通道信息来提高语音增强效果。实验验证了所提方法的模型容量在只增加了约0.02%的情况下，语音增强性能获得了有效提升。

关键词: 语音增强, 端到端语音分离网络, 通道注意力

Abstract:

The full convolutional time-domain audio separation network (Conv-TasNet) is a state-of-the-art end-to-end speech separation model which was proposed recently.The Conv-TasNet used dilated convolution to expand the receptive field and fuse more speech features in space, which greatly improved the speech separation performance of the network, but at the same time ignored the importance of information across different convolution channels.An end-to-end speech enhancement method based on ultra-lightweight channel attention was proposed, which effectively combined Conv-TasNet and channel attention.At the same time, a group of filters was added to the Conv-TasNet codec to improve the speech feature extraction ability of the network.This method can make convolutional neural network combine spatial information and channel information more effectively to improve the speech enhancement effect.Experiment shows that the proposed model can effectively improve the performance of speech enhancement when the model capacity is only increased by about 0.02%.

Key words: speech enhancement, end-to-end speech separation network, channel attention

中图分类号:

TP183

洪依,孙成立,冷严. 基于超轻量通道注意力的端对端语音增强方法[J]. 智能科学与技术学报, 2021, 3(3): 351-358.

Yi HONG,Chengli SUN,Yan LENG. End-to-end speech enhancement based on ultra-lightweight channel attention[J]. Chinese Journal of Intelligent Science and Technology, 2021, 3(3): 351-358.

图/表 12

图1

图2

图3

图4

图5

图6

图7

表1

表2

在Conv-TasNet相同位置添加不同通道注意力网络的性能"

方法	SDR/dB	SI-SDR/dB	PESQ	STOI	网络参数量/M
Conv-TasNet	14.668	13.895	2.835	0.916	4.984
Conv-TasNet+SENet	14.700	13.959	2.826	0.916	5.773
Conv-TasNet+ECA-Net	14.827	14.094	2.870	0.918	4.985
Conv-TasNet+Deep-ECA-Net	$14 . 888$	$14 . 095$	$2 . 903$	$0 . 920$	$6 . 173$

表2

图8

图9

表3

ECA-Net处于Conv-TasNet不同位置时的网络性能"

方法	SDR/dB	SI-SDR/dB	PESQ	STOI
Conv-TasNet	14.668	13.895	2.835	0.916
ECA-Net1	14.667	13.898	2.836	0.916
ECA-Net2	$14 . 827$	$14 . 094$	$2 . 870$	$0 . 918$
ECA-Net3	14.812	14.030	2.866	0.917

表3

参考文献 33

[1]	LOIZOU P . Speech enhancement:theory and practice[M]. Boca Raton: CRC Press, 2007.
[2]	ZHOU N , DU J , TU Y H ,et al. A speech enhancement neural network architecture with SNR-progressive multi-target learning for robust speech recognition[C]// Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. Piscataway:IEEE Press, 2019: 873-877.
[3]	WU B , YU M , CHEN L W ,et al. Distortionless multi-channel target speech enhancement for overlapped speech recognition[J]. arXiv preprints,2020,arXiv:2007.01566.
[4]	张钹 . 人工智能进入后深度学习时代[J]. 智能科学与技术学报, 2019,1(1): 4-6.
	ZHANG B . Artificial intelligence is entering the post deep-learning era[J]. Chinese Journal of Intelligent Science and Technology, 2019,1(1): 4-6.
[5]	GRZYWALSKI T , DRGAS S . Using recurrences in time and frequency within U-Net architecture for speech enhancement[C]// Proceedings of the 2019 IEEE International Conference on Acoustics,Speech and Signal Processing. Piscataway:IEEE Press, 2019: 6970-6974.
[6]	TAN K , WANG D L . A convolutional recurrent neural network for real-time speech enhancement[C]// Proceedings of the INTERSPEECH 2018.[S.l.:s.n.], 2018: 3229-3233.
[7]	WANG D , LIM J . The unimportance of phase in speech enhancement[J]. IEEE Transactions on Acoustics,Speech,and Signal Processing, 1982,30(4): 679-681.
[8]	PALIWAL K , WóJCICKI K , SHANNON B . The importance of phase in speech enhancement[J]. Speech Communication, 2011,53(4): 465-494.
[9]	LU X G , TSAO Y , MATSUDA S ,et al. Speech enhancement based on deep denoising autoencoder[C]// Proceedings of the INTERSPEECH 2018.[S.l.:s.n.], 2018: 436-440.
[10]	DEN OORD A V , DIELEMAN S , ZEN H ,et al. WaveNet:a generative model for raw audio[J]. arXiv preprint,2016,arXiv:1609.03499.
[11]	PANDEY A , WANG D L . A new framework for CNN-based speech enhancement in the time domain[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing, 2019,27(7): 1179-1188.
[12]	LUO Y , MESGARANI N . Conv-TasNet:surpassing ideal time- frequency magnitude masking for speech separation[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing, 2019,27(8): 1256-1266.
[13]	BAI S J , KOLTER J Z , KOLTUN V . An empirical evaluation of generic convolutional and recurrent networks for sequence modeling[J]. arXiv preprint,2018,arXiv:1803.01271.
[14]	SHI Z Q , LIN H B , LIU L ,et al. Deep attention gated dilated temporal convolutional networks with intra-parallel convolutional modules for end-to-end monaural speech separation[C]// Proceedings of the INTERSPEECH 2019.[S.l.:s.n.], 2019: 3183-3187.
[15]	SHI Z Q , LIN H B , LIU L ,et al. End-to-end monaural speech separation with multi-scale dynamic weighted gated dilated convolutional pyramid network[C]// Proceedings of the INTERSPEECH 2019.[S.l.:s.n.], 2019: 4614-4618.
[16]	马玮良, 彭轩, 熊倩 ,等. 深度学习中的内存管理问题研究综述[J]. 大数据, 2020,6(4): 56-68.
	MA W L , PENG X , XIONG Q ,et al. Memory management in deep learning:a survey[J]. Big Data Research, 2020,6(4): 56-68.
[17]	COSENTINO J , PARIENTE M , CORNELL S ,et al. LibriMix:an open-source dataset for generalizable speech separation[J]. arXiv preprint,2020,arXiv:2005.11262.
[18]	HE K M , ZHANG X Y , REN S Q ,et al. Delving deep into rectifiers:surpassing human-level performance on ImageNet classification[C]// Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway:IEEE Press, 2015: 1026-1034.
[19]	LEA C , VIDAL R , REITER A ,et al. Temporal convolutional networks:a unified approach to action segmentation[M]. Lecture notes in computer science. Cham: Springer International Publishing, 2016.
[20]	HE K M , ZHANG X Y , REN S Q ,et al. Deep residual learning for image recognition[C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2016: 770-778.
[21]	HOWARD A G , ZHU M L , CHEN B ,et al. MobileNets:efficient convolutional neural networks for mobile vision applications[J]. arXiv preprint,2017,arXiv:1704.04861.
[22]	LUO Y , MESGARANI N . TaSNet:time-domain audio separation network for real-time,single-channel speech separation[C]// Proceedings of the 2018 IEEE International Conference on Acoustics,Speech and Signal Processing. Piscataway:IEEE Press, 2018: 696-700.
[23]	WANG Q L , WU B G , ZHU P F ,et al. ECA-Net:efficient channel attention for deep convolutional neural networks[C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2020: 11531-11539.
[24]	HU J , SHEN L , ALBANIE S ,et al. Squeeze-and-excitation networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020,42(8): 2011-2023.
[25]	SZEGEDY C , LIU W JIA Y Q ,et al. Going deeper with convolutions[C]// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2014.
[26]	BELL S , ZITNICK C L , BALA K ,et al. Inside-Outside net:detecting objects in context with skip pooling and recurrent neural networks[C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2016: 2874-2883.
[27]	ZHANG X Y , ZHOU X Y , LIN M X ,et al. ShuffleNet:an extremely efficient convolutional neural network for mobile devices[C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2018: 6848-6856.
[28]	KRIZHEVSKY A , SUTSKEVER I , HINTON G E . ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017,60(6): 84-90.
[29]	PANAYOTOV V , CHEN G G , POVEY D ,et al. LibriSpeech:an ASR corpus based on public domain audio books[C]// Proceedings of the 2015 IEEE International Conference on Acoustics,Speech and Signal Processing. Piscataway:IEEE Press, 2015: 5206-5210.
[30]	WICHERN G , ANTOGNINI J , FLYNN M ,et al. WHAM!:extending speech separation to noisy environments[C]// Proceedings of the INTERSPEECH 2019.[S.l.:s.n.], 2019.
[31]	ROUX J L , WISDOM S , ERDOGAN H ,et al. SDR – half-baked or well done?[C]// Proceedings of the 2019 IEEE International Conference on Acoustics,Speech and Signal Processing. Piscataway:IEEE Press, 2019: 626-630.
[32]	RIX ANTONY W , BEERENDS JOHN G , HOLLIER MICHAEL P ,et al. Perceptual evaluation of speech quality (PESQ) - a new method for speech quality assessment of telephone networks and codecs[C]// Proceedings of the 2001 IEEE International Conference on Acoustics,Speech,and Signal Processing. Piscataway:IEEE Press, 2002: 749-752.
[33]	TAAL C H , HENDRIKS R C , HEUSDENS R ,et al. An algorithm for intelligibility prediction of time–frequency weighted noisy speech[J]. IEEE Transactions on Audio Speech ＆ Language Processing, 2011,19(7): 2125-2136.

子集	时长/h	说话人平均时长/min	女说话者数量/个	男说话者数量/个	总人数/个
验证集	5.4	8	20	20	40
测试集	5.4	8	20	20	40
训练集	100.6	25	125	126	251

基于超轻量通道注意力的端对端语音增强方法

End-to-end speech enhancement based on ultra-lightweight channel attention

在线阅读

PDF下载

可视化

摘要/Abstract

引用本文

使用本文

图/表 12

参考文献 33

相关文章 7

Metrics

推荐阅读 0

[1]	黄哲, 王永才, 李德英. 3D目标检测方法研究综述[J]. 智能科学与技术学报, 2023, 5(1): 7-31.
[2]	陈妍, 罗雪琴, 梁伟, 谢永芳. 基于情感信息融合注意力机制的抑郁症识别[J]. 智能科学与技术学报, 2022, 4(4): 600-609.
[3]	张强, 文闻, 周晓东, 刘维惠, 初晓昱. 基于改进TD3算法的机械臂智能规划方法研究[J]. 智能科学与技术学报, 2022, 4(2): 223-232.
[4]	曾淦雄, 柯逍. 基于3D卷积的图像序列特征提取与自注意力的车牌识别方法[J]. 智能科学与技术学报, 2021, 3(3): 268-279.
[5]	李莉, 曾伟良, 黄永慧, 孙为军. 基于半监督学习的人脸识别反欺骗方法研究[J]. 智能科学与技术学报, 2021, 3(3): 370-380.
[6]	张国宾,王新迎. 基于混合神经网络的光伏组件输出特性数据驱动建模方法[J]. 智能科学与技术学报, 2020, 2(2): 169-178.
[7]	魏雅婷,王智勇,周舒悦,陈为. 联邦可视化：一种隐私保护的可视化新模型[J]. 智能科学与技术学报, 2019, 1(4): 415-420.