基于多域融合及神经架构搜索的语音增强方法

doi:10.11959/j.issn.1000-436x.2024018

通信学报 ›› 2024, Vol. 45 ›› Issue (2): 225-239.doi: 10.11959/j.issn.1000-436x.2024018

• 学术通信 • 上一篇

基于多域融合及神经架构搜索的语音增强方法

张睿, 张鹏云, 孙超利

太原科技大学计算机科学与技术学院，山西太原 030024

修回日期:2023-12-19 出版日期:2024-02-01 发布日期:2024-02-01
作者简介:张睿（1987− ），男，山西太原人，博士，太原科技大学副教授、硕士生导师，主要研究方向为智能信息处理、自动机器学习等
张鹏云（1999− ），男，河北安平人，太原科技大学硕士生，主要研究方向为智能信息处理、自动机器学习等
孙超利（1978− ），女，浙江诸暨人，博士，太原科技大学教授、博士生导师，主要研究方向为计算智能、机器学习等
基金资助:
国家自然科学基金资助项目(62372319);教育部人文社会科学研究基金资助项目(23YJCZH299);山西省重点研发计划基金资助项目(202102020101002);山西省基础研究计划基金资助项目(20210302123216);太原科技大学研究生联合培养示范基地基金资助项目(JD2022004);太原科技大学研究生教育创新基金资助项目(SY2023040)

Speech enhancement method based on multi-domain fusion and neural architecture search

Rui ZHANG, Pengyun ZHANG, Chaoli SUN

College of Computer Science and Technology, Taiyuan University of Science and Technology, Taiyuan 030024, China

Revised:2023-12-19 Online:2024-02-01 Published:2024-02-01
Supported by:
The National Natural Science Foundation of China(62372319);Humanities and Social Science Research Project of Ministry of Education(23YJCZH299);The Key Research and Development Project of Shanxi Province(202102020101002);Basic Research Project of Shanxi Province(20210302123216);Project of Graduate Joint Training Demon-stration Base of Taiyuan University of Science and Technology(JD2022004);Graduate Education Innovation Project of Taiyuan University of Science and Technology(SY2023040)

摘要/Abstract

摘要：

为进一步提高语音增强模型的自学习及降噪能力，提出基于多域融合及神经架构搜索的语音增强方法。该方法设计了语音信号多空间域映射及融合机制，实现信号实复数关联关系的挖掘；围绕模型卷积池化运算特点，提出了复数神经架构搜索机制，通过设计的搜索空间、搜索策略及评估策略，高效自动地构建出语音增强模型。实验搜索到的最优语音增强模型与基线模型的对比泛化实验中，语音质量客观评价（PESQ）、短时客观可懂度（STOI）两大指标较最优基线模型均最大提升5.6%，且模型参数量最低。

关键词: 语音增强模型, 复数空间域映射, 多域融合, 复数神经架构搜索, 低成本评估

Abstract:

In order to further improve the self-learning and noise reduction ability of speech enhancement model, a speech enhancement method based on multi-domain fusion and neural architecture search was proposed.The multi-spatial domain mapping and fusion mechanism of speech signals were designed to realize the mining of real complex number correlation.Based on the characteristics of convolution pooling of the model, a complex neural architecture search mechanism was proposed, and the speech enhancement model was constructed efficiently and automatically through the designed search space, search strategy and evaluation strategy.In the comparison and generalization experiment between the optimal speech enhancement model and the baseline model, the two indexes of PESQ and STOI increase by 5.6% compared with the optimal baseline model, and the number of model parameters is the lowest.

Key words: speech enhancement model, complex spatial domain mapping, multi-domain fusion, complex neural archi-tecture search, low-cost evaluation

中图分类号:

TP18

张睿, 张鹏云, 孙超利. 基于多域融合及神经架构搜索的语音增强方法[J]. 通信学报, 2024, 45(2): 225-239.

Rui ZHANG, Pengyun ZHANG, Chaoli SUN. Speech enhancement method based on multi-domain fusion and neural architecture search[J]. Journal on Communications, 2024, 45(2): 225-239.

图/表 20

图1

图2

图3

图4

图5

表1

图6

图7

图8

图9

图10

表2

不同策略性能对比"

策略类型	策略	搜索时间/s	cifar10		cifar100		ImageNet16-120
策略类型	策略	搜索时间/s	验证集	测试集	验证集	测试集	验证集	测试集
	REA	12 000	91.19±0.31	93.92±0.30	71.81±1.12	71.84±0.99	45.15±0.89	45.54±1.03
非权重共享	RS	12 000	90.93±0.36	93.70±0.36	70.93±1.09	71.04±1.07	44.45±1.10	44.57±1.25
	REINFORCE	12 000	91.09±0.37	93.85±0.37	71.61±1.12	71.71±1.09	45.05±1.02	45.24±1.18
	BOHB	12 000	90.82±0.53	93.61±0.52	70.74±1.29	70.85±1.28	44.26±1.36	44.42±1.49
	RSPS	7 587	84.16±1.69	87.66±1.69	59.00±4.60	58.33±4.34	31.56±3.28	31.14±3.88
	DARTS-V1	10 890	39.77±0.00	54.30±0.00	15.03±0.00	15.61±0.00	16.43±0.00	16.32±0.00
权重共享	DARTS-V2	29 902	39.77±0.00	54.30±0.00	15.03±0.00	15.61±0.00	16.43±0.00	16.32±0.00
	GDAS	28 926	90.00±0.21	93.51±0.13	71.14±0.27	70.61±0.26	41.70±1.26	41.84±0.90
	SETN	31 010	82.25±5.17	86.19±4.63	56.86±7.59	56.87±7.77	32.54±3.63	31.90±4.07
	ENAS	13 315	39.77±0.00	54.30±0.00	15.03±0.00	15.61±0.00	16.43±0.00	16.32±0.00
	I (本文策略及NAS-WOT)	—	0.001 2		0.019 0		0.036 0
	NAS-WOT (N=10)	3.6	89.16±1.56	91.40±1.13	69.26±2.25	69.10±2.06	41.98±4.01	41.20±4.11
	本文策略(N=10)	3.1	89.86±0.12	91.62±0.30	69.77±1.45	69.11±1.72	41.77±3.33	41.30±4.27
	NAS-WOT (N=100)	30.9	89.51±0.78	91.31±1.12	68.13±1.05	69.18±1.41	42.33±3.23	42.48±3.01
	本文策略(N=100)	28.6	89.73±1.16	91.64±0.17	68.28±2.01	69.19±0.82	39.99±3.25	41.52±4.45
低成本	NAS-WOT (N=500)	130.2	88.90±0.61	91.61±1.07	68.52±1.22	68.04±1.41	39.69±2.05	39.77±2.10
	本文策略(N=500)	110.6	88.96±0.35	92.19±1.44	69.03±0.72	69.46±0.71	40.93±2.39	42.15±2.11
	NAS-WOT (N=1 000)	310.3	89.63±0.73	91.30±0.81	68.77±1.21	68.58±1.22	39.21±2.12	39.12±1.78
	本文策略(N=1 000)	256.2	89.97±1.85	92.33±1.01	69.68±0.92	69.56±0.93	41.73±2.03	42.77±1.98
	NAS-WOT (N=2 000)	601.5	89.90±1.44	91.33±0.99	69.33±1.41	69.98±2.22	40.21±2.11	40.32±3.08
	本文策略(N=2 000)	505.9	91.09±2.15	93.95±1.33	69.89±1.48	71.99±1.88	42.53±3.13	43.95±2.53
	Optimal (N=10)	—	90.11±0.75	93.40±0.49	70.13±1.98	70.13±1.98	44.77±1.77	44.77±1.77
	Optimal (N=100)	—	91.11±0.12	94.02±0.11	72.81±0.90	72.81±0.90	46.01±0.47	46.01±0.47
最优值	Optimal (N=500)	—	91.14±0.17	94.10±0.22	72.91±0.64	72.91±0.64	46.02±0.73	46.02±0.73
	Optimal (N=1 000)	—	91.32±0.11	94.20±0.14	72.93±0.41	72.93±0.41	46.62±0.57	46.62±0.57
	Optimal (N=2 000)	—	91.36±1.07	94.25±1.08	72.95±0.47	72.95±0.47	46.68±0.45	46.68±0.45

表2

表3

图11

表4

表5

表6

表7

表8

表9

参考文献 26

[1]	解元, 邹涛, 孙为军 ,等. 面向高混响环境的欠定卷积盲源分离算法[J]. 通信学报, 2023,44(2): 82-93.
	XIE Y , ZOU T , SUN W J ,et al. Algorithm of underdetermined convolutive blind source separation for high reverberation environment[J]. Journal on Communications, 2023,44(2): 82-93.
[2]	GHOLAMIANGONABADI D , GROLINGER K . Personalized models for human activity recognition with wearable sensors:deep neural networks and signal processing[J]. Applied Intelligence, 2023,53(5): 6041-6061.
[3]	YIN D C , LUO C , XIONG Z W ,et al. PHASEN:a phase-andharmonics-aware speech enhancement network[C]// Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto:AAAI Press, 2020: 9458-9465.
[4]	TAN K , WANG D L . A convolutional recurrent neural network for real-time speech enhancement[C]// Proceedings of the Interspeech. Hyderabad:ISCA Press, 2018: 3229-3233.
[5]	CHOI H S , KIM J H , HUH J ,et al. Phase-aware speech enhancement with deep complex U-net[J]. arXiv Preprint,arXiv:1903.03107, 2019.
[6]	RONNEBERGER O , FISCHER P , BROX T . U-Net:convolutional networks for biomedical image segmentation[J]. arXiv Preprint,arXiv:1505.04597, 2015.
[7]	HU Y X , LIU Y , LV S B ,et al. DCCRN:deep complex convolution recurrent network for phase-aware speech enhancement[C]// Proceedings of the Interspeech. Hyderabad:ISCA Press, 2020: 2472-2476.
[8]	BIAN Y J , SONG Q Q , DU M N ,et al. Subarchitecture ensemble pruning in neural architecture search[J]. IEEE Transactions on Neural Networks and Learning Systems, 2022,33(12): 7928-7936.
[9]	BAKER B , GUPTA O , NAIK N ,et al. Designing neural network architectures using reinforcement learning[J]. arXiv Preprint,arXiv:1611.02167, 2016.
[10]	BEECHE C , SINGH J P , LEADER J K ,et al. Super U-Net:a modularized generalizable architecture[J]. Pattern Recognition, 2022,128:108669.
[11]	LIU C X , ZOPH B , NEUMANN M ,et al. Progressive neural architecture search[C]// European Conference on Computer Vision. Berlin:Springer, 2018: 19-35.
[12]	HUANG L , SUN S Q , ZENG J ,et al. U-DARTS:uniform-space differentiable architecture search[J]. Information Sciences, 2023,628: 339-349.
[13]	LUO R Q , TIAN F , QIN T ,et al. Neural architecture optimization[J]. arXiv Preprint,arXiv:1808.07233, 2018.
[14]	MELLOR J , TURNER J , STORKEY A ,et al. Neural architecture search without training[C]// Proceedings of the 38th International Conference on Machine Learning. New York:PMLR, 2021: 7588-7598.
[15]	LOPES V , ALIREZAZADEH S , ALEXANDRE L A . EPE-NAS:efficient performance estimation without training for neural architecture search[C]// International Conference on Artificial Neural Networks. Berlin:Springer, 2021: 552-563.
[16]	ZHANG R , ZHANG P Y , GAO M R ,et al. Self-optimizing multi-domain auxiliary fusion deep complex convolution recurrent network for speech enhancement[J]. Digital Signal Processing, 2023,134:103897.
[17]	XIE L X , YUILLE A . Genetic CNN[C]// Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV). Piscataway:IEEE Press, 2017: 1388-1397.
[18]	胡向东, 吕高飞, 白银 . 基于优化支持向量回归的工业互联网安全态势预测方法[J]. 电子学报, 2023,51(2): 446-454.
	HU X D , LYU G F , BAI Y . A method of security situation prediction for industrial Internet based on optimized support vector regression[J]. Acta Electronica Sinica, 2023,51(2): 446-454.
[19]	WANG X Y , LIU P B . Image encryption based on roulette cascaded chaotic system and alienated image library[J]. The Visual Computer, 2022,38(3): 763-779.
[20]	ZHANG S X , YANG Y , ZHANG M ,et al. A mutli-feature correlation filter tracker with different hash algorithm[C]// Proceedings of the 2021 IEEE 24th International Conference on Information Fusion (FUSION). Piscataway:IEEE Press, 2021: 1-6.
[21]	YING C , KLEIN A , CHRISTIANSEN E ,et al. NAS-Bench-101:towards reproducible neural architecture search[C]// Proceedings of the 36th International Conference on Machine Learning. New York:PMLR, 2019: 7105-7114.
[22]	DONG X Y , YANG Y . NAS-Bench-201:extending the scope of reproducible neural architecture search[J]. arXiv Preprint,arXiv:2001.00326, 2020.
[23]	KYRIAKIDES G , MARGARITIS K . The effect of reduced training in neural architecture search[J]. Neural Computing and Applications, 2020,32(23): 17321-17332.
[24]	LI L , TALWALKAR A . Random search and reproducibility for neural architecture search[C]// Proceedings of Uncertainty in Artificial Intelligence. New York:PMLR, 2020: 367-377.
[25]	WANG D , ZHANG X W . THCHS-30:a free chinese speech corpus[J]. arXiv Preprint,arXiv:1512.01882, 2015.
[26]	GAROFOLO J , GRAFF D , PAUL D ,et al. Linguistic data consortium CSR-I (WSJ0) database[R]. 1993.

操作层	输入大小	输出大小
reshape_1	T× 257	1 × T × 257
复数下采样Cell_1	1 × T × 257	16 × T × 129
复数下采样Cell_2	16 × T × 129	32 × T × 65
复数下采样Cell_3	32 × T × 65	64 × T × 33
复数下采样Cell_4	64 × T × 33	128 × T × 17
复数下采样Cell_5	128 × T × 17	256 × T × 9
reshape_2	256 × T × 9	T× 2 304
复数LSTM及其他模块	T× 2 304	T× 2 304
reshape_3	T× 2 304	256 × T × 9
复数上采样Cell_5	512 × T × 9	128 × T × 17
复数上采样Cell_4	256 × T × 17	64 × T × 33
复数上采样Cell_3	128 × T × 33	32 × T × 65
复数上采样Cell_2	64 × T × 65	16 × T × 129
复数上采样Cell_1	32 × T × 129	1 × T × 257
reshape_4	1 × T × 257	T× 257

参数	取值
种群大小	20
进化次数	20
交叉概率	0.5
变异概率	0.5
全局感知次数	5
局部细化次数	5
评估小批次大小	128

模型	SNR=0		SNR=5 dB		SNR=10 dB		SNR=15 dB		SNR=20 dB		参数量	辅助域
模型	PESQ	STOI	PESQ	STOI	PESQ	STOI	PESQ	STOI	PESQ	STOI	参数量	辅助域
Noisy	2.07	0.71	2.11	0.73	2.23	0.77	2.30	0.80	2.39	0.81	—	—
CRN	2.42	0.74	2.45	0.77	2.50	0.80	2.58	0.81	2.64	0.86	6.1×10⁶	—
LSTM	2.41	0.73	2.40	0.75	2.49	0.79	2.59	0.81	2.62	0.83	9.6×10⁶	—
DCUNet	2.45	0.74	2.48	0.79	2.52	0.81	2.61	0.83	2.70	0.88	3.6×10⁶	—
ConvTasNet	2.42	0.74	2.46	0.78	2.51	0.80	2.60	0.80	2.66	0.86	5.1×10⁶	—
DCCRN	2.56	0.78	2.57	0.83	2.61	0.84	2.69	0.87	2.70	0.91	3.7×10⁶	—
AMDCCRN	2.74	0.81	2.76	0.85	2.85	0.90	2.96	0.92	3.17	0.94	3.6×10⁶	GASF
本文方法	2.75	0.82	2.77	0.88	2.86	0.92	3.02	0.93	3.20	0.95	3.5×10⁶	GASF

模型	SNR=0		SNR=5 dB		SNR=10 dB		SNR=15 dB		SNR=20 dB		参数量	辅助域
模型	PESQ	STOI	PESQ	STOI	PESQ	STOI	PESQ	STOI	PESQ	STOI	参数量	辅助域
Noisy	2.08	0.72	2.12	0.72	2.25	0.76	2.31	0.80	2.42	0.81	—	—
CRN	2.43	0.76	2.47	0.79	2.52	0.83	2.60	0.84	2.67	0.90	6.1×10⁶	—
LSTM	2.42	0.76	2.41	0.77	2.50	0.82	2.59	0.84	2.63	0.86	9.6×10⁶	—
DCUNet	2.45	0.75	2.47	0.82	2.53	0.84	2.60	0.85	2.72	0.90	3.6×10⁶	—
ConvTasNet	2.43	0.74	2.47	0.80	2.52	0.83	2.60	0.84	2.69	0.91	5.1×10⁶	—
DCCRN	2.55	0.79	2.56	0.86	2.61	0.87	2.67	0.90	2.69	0.92	3.7×10⁶	—
AMDCCRN	2.75	0.82	2.75	0.87	2.86	0.91	2.97	0.94	3.18	0.95	3.7×10⁶	GASF
本文方法	2.75	0.81	2.77	0.85	2.89	0.93	3.01	0.95	3.21	0.95	3.6×10⁶	GASF

模型	SNR=0		SNR=5 dB		SNR=10 dB		SNR=15 dB		SNR=20 dB		参数量	辅助域
模型	PESQ	STOI	PESQ	STOI	PESQ	STOI	PESQ	STOI	PESQ	STOI	参数量	辅助域
Noisy	2.09	0.71	2.11	0.72	2.24	0.77	2.30	0.79	2.41	0.80	—	—
CRN	2.41	0.75	2.45	0.78	2.51	0.82	2.60	0.84	2.65	0.89	6.1×10⁶	—
LSTM	2.42	0.76	2.44	0.77	2.50	0.81	2.59	0.83	2.64	0.87	9.6×10⁶	—
DCUNet	2.44	0.75	2.47	0.83	2.52	0.85	2.61	0.87	2.73	0.92	3.6×10⁶	—
ConvTasNet	2.42	0.74	2.46	0.81	2.51	0.84	2.62	0.85	2.70	0.92	5.1×10⁶	—
DCCRN	2.54	0.79	2.55	0.86	2.62	0.88	2.67	0.90	2.75	0.93	3.7×10⁶	—
AMDCCRN	2.74	0.85	2.76	0.89	2.85	0.92	2.99	0.94	3.15	0.95	3.7×10⁶	GADF
本文方法	2.74	0.86	2.78	0.90	3.01	0.94	3.05	0.94	3.17	0.95	3.6×10⁶	GADF

基于多域融合及神经架构搜索的语音增强方法

Speech enhancement method based on multi-domain fusion and neural architecture search

在线阅读

PDF下载

可视化

摘要/Abstract

引用本文

使用本文

图/表 20

参考文献 26

相关文章 1

Metrics

推荐阅读 0