采用表示分离自编码器的任意说话人语音转换

doi:10.11959/j.issn.1000-436x.2024044

摘要/Abstract

摘要：

针对非平行语料库下任意说话人之间的语音转换存在语言内容信息和说话人个性特征难以分离，从而导致语音转换的性能不佳的问题，提出了一种采用表示分离自编码器的语音转换方法 RSAE-VC。该方法将语音信号的说话人个性特征视为时不变，而将内容信息视为时变，利用编码器中的实例归一化和激活引导层将两者进行分离，再由解码器将源语音的内容信息与目标语音的个性特征进行合成，从而生成转换后的语音。实验结果表明， RSAE-VC在梅尔倒谱距离上比现有的AGAIN-VC转换方法平均降低了3.11%，在基音频率均方根误差上降低了2.41%，MOS分和ABX值分别提升了5.22%和8.45%。RSAE-VC方法通过自内容损失进行约束使语音更好地保留内容信息，通过自说话人损失将说话人个性特征更好地从语音中分离，可以确保说话人个性特征尽少地遗留在内容信息中，从而提高语音转换性能。

关键词: 语音转换, 表示分离, 自适应实例归一化, 自内容损失, 自说话人损失

Abstract:

In view of the problem that it was difficult to separate speaker personality characteristics from semantic content information in any-to-any voice conversion under non-parallel corpus, which led to unsatisfied performance, a voice conversion method, called RSAE-VC (representation separation auto-encoder voice conversion) was proposed.The speaker’s personality characteristics in the speech were regarded as time invariant and the content information as time variant, and the instance normalization and activation guidance layer were used in the encoder to separate them from each other.Then the content information of the source speech and the personality characteristics of the target one was utilized to synthesize the converted speech by the decoder.The experimental results demonstrate that RSAE-VC has an average reduction of 3.11% and 2.41% in Mel cepstral distance and root mean square error of pitch frequency respectively, and has an increasement of 5.22% in MOS and 8.45% in ABX, compared with the AGAIN-VC (activation guidance and adaptive instance normalization voice conversion) method.In RSAE-VC, self-content loss is applied to make the converted speech reserve more content information, and self-speaker loss is used to separate the speaker personality characteristics from the speech better, which ensure the speaker personality characteristics be left in the content information as little as possible, and the conversion performance is improved.

Key words: voice conversion, representation separation, adaptive instance normalization, self-content loss, self-speaker loss

中图分类号:

TP391.42

简志华, 章子旭. 采用表示分离自编码器的任意说话人语音转换[J]. 通信学报, 2024, 45(2): 162-172.

Zhihua JIAN, Zixu ZHANG. Any-to-any voice conversion using representation separation auto-encoder[J]. Journal on Communications, 2024, 45(2): 162-172.

图/表 14

图1

图2

图3

图4

图5

表1

表2

表3

表4

表5

图6

表6

图7

图8

参考文献 28

[1]	SISMAN B , YAMAGISHI J , KING S ,et al. An overview of voice conversion and its challenges:from statistical modeling to deep learning[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing, 2021,29: 132-157.
[2]	MOUCHTARIS A , AGIOMYRGIANNAKIS Y , STYLIANOU Y . Conditional vector quantization for voice conversion[C]// Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing. Piscataway:IEEE Press, 2007: 505-508.
[3]	AIHARA R , TAKASHIMA R , TAKIGUCHI T ,et al. GMM-based emotional voice conversion using spectrum and prosody features[J]. American Journal of Signal Processing, 2012,2(5): 134-138.
[4]	HELANDER E , SILEN H , VIRTANEN T ,et al. Voice conversion using dynamic kernel partial least squares regression[J]. IEEE Transactions on Audio,Speech,and Language Processing, 2012,20(3): 806-817.
[5]	WU Z Z , VIRTANEN T , CHNG E S ,et al. Exemplar-based sparse representation with residual compensation for voice conversion[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing, 2014,22(10): 1506-1521.
[6]	SUN L F , LI K , WANG H ,et al. Phonetic posterior grams for many-to-one voice conversion without parallel data training[C]// Proceedings of IEEE International Conference on Multimedia and Expo (ICME). Piscataway:IEEE Press, 2016: 1-6.
[7]	MURAKAMI H , HARA S , ABE M . DNN-based voice conversion with auxiliary phonemic information to improve intelligibility of glossectomy patients' speech[C]// Proceedings of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). Piscataway:IEEE Press, 2019: 138-142.
[8]	ALAA Y , ALFONSE M , AREF M M . A survey on generative adversarial networks based models for many-to-many non-parallel voice conversion[C]// Proceedings of 5th International Conference on Computing and Informatics (ICCI). Piscataway:IEEE Press, 2022: 221-226.
[9]	KANEKO T , KAMEOKA H , TANAKA K ,et al. CycleGAN-VC2:improved cyclegan-based non-parallel voice conversion[C]// Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). Piscataway:IEEE Press, 2019: 6820-6824.
[10]	KAMEOKA H , KANEKO T , TANAKA K ,et al. StarGAN-VC:non-parallel many-to-many voice conversion using star generative adversarial networks[C]// Proceedings of IEEE Spoken Language Technology Workshop (SLT). Piscataway:IEEE Press, 2018: 266-273.
[11]	QIAN K Z , ZHANG Y , CHANG S Y ,et al. AUTOVC:zero-shot voice style transfer with only autoencoder loss[C]// Proceedings of 36th International Conference on Machine Learning (ICML). Piscataway:IEEE Press, 2019: 5210-5219.
[12]	DENG C H , CHEN Y , DENG H F . One-shot voice conversion algorithm based on representations separation[J]. IEEE Access, 2020,8: 196578-196586.
[13]	CHEN Y H , WU D Y , WU T H ,et al. AGAIN-VC:a one-shot voice conversion using activation guidance and adaptive instance normalization[C]// Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). Piscataway:IEEE Press, 2021: 5954-5958.
[14]	WANG Q Q , ZHANG X L , WANG J Z ,et al. DRVC:a framework of any-to-any voice conversion with self-supervised learning[C]// Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). Piscataway:IEEE Press, 2022: 3184-3188.
[15]	DANG T , TRAN D , CHIN P ,et al. Training robust zero-shot voice conversion models with self-supervised features[C]// Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). Piscataway:IEEE Press, 2022: 6557-6561.
[16]	CHOU J C , LEE H Y . One-shot voice conversion by separating speaker and content representations with instance normalization[C]// Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH). Piscataway:IEEE Press, 2019: 664-668.
[17]	WANG X , TAKAKI S , YAMAGISHI J ,et al. A vector quantized variational autoencoder (VQ-VAE) autoregressive neural F₀model for statistical parametric speech synthesis[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing, 2019,28: 157-170.
[18]	WU D Y , LEE H Y . One-shot voice conversion by vector quantization[C]// Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). Piscataway:IEEE Press, 2020: 7734-7738.
[19]	YANG S D , YU X Y , ZHOU Y . LSTM and GRU neural network performance comparison study:taking yelp review dataset as an example[C]// Proceedings of International Workshop on Electronic Communication and Artificial Intelligence (IWECAI). Piscataway:IEEE Press, 2020: 98-101.
[20]	PRASAD S , MANU A , KAPOOR A ,et al. Non-parallel denoised voice conversion using vector quantisation[C]// Proceedings of 4th International Conference on Recent Trends in Computer Science and Technology (ICRTCST). Piscataway:IEEE Press, 2022: 78-83.
[21]	WANG Z C , XIE Q C , LI T ,et al. One-shot voice conversion for style transfer based on speaker adaptation[C]// Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). Piscataway:IEEE Press, 2022: 6792-6796.
[22]	KANEKO T , KAMEOKA H , TANAKA K ,et al. Maskcyclegan-VC:learning non-parallel voice conversion with filling in frames[C]// Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). Piscataway:IEEE Press, 2021: 5919-5923.
[23]	SONG K , CONG J , WANG X S ,et al. Robust MelGAN:a robust universal neural vocoder for high-fidelity TTS[C]// Proceedings of 13th International Symposium on Chinese Spoken Language Processing (ISCSLP). Piscataway:IEEE Press, 2022: 71-75.
[24]	周健, 刘荣敏, 窦云峰 ,等. 采用 L_1/2稀疏约束的梅尔倒谱系数语音重建方法[J]. 声学学报, 2018,43(6): 991-999.
	ZHOU J , LIU R M , DOU Y F ,et al. Speech reconstruction from Mel-frequency cepstral coefficients via L_1/2sparse constraint[J]. Acta Acustica, 2018,43(6): 991-999.
[25]	LEE S H , NOH H R , NAM W J ,et al. Duration controllable voice conversion via phoneme-based information bottleneck[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing, 2022,30: 1173-1183.
[26]	林云, 徐怀韬, 王森 ,等. 基于特征融合的通信语音干扰效果客观评估[J]. 通信学报, 2023,44(3): 105-116.
	LIN Y , XU H T , WANG S ,et al. Objective assessment of communication speech interference effect based on feature fusion[J]. Journal on Communications, 2023,44(3): 105-116.
[27]	PRIHASTO B , LIN Y X , LE P T ,et al. CNEG-VC:contrastive learning using hard negative example in non-parallel voice conversion[C]// Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). Piscataway:IEEE Press, 2023: 1-5.
[28]	SHAH N , SINGH M , TAKAHASHI N ,et al. Nonparallel emotional voice conversion for unseen speaker-emotion pairs using dual domain adversarial network ＆ virtual domain pairing[C]// Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). Piscataway:IEEE Press, 2023: 1-5.

λ₁	F2F	M2M	F2M	M2F
2.0	9.111	8.284	8.831	9.427
2.5	9.119	8.401	8.897	9.420
3.0	9.174	8.421	8.862	9.672
3.5	9.075	8.358	8.641	9.466
4.0	9.318	8.388	8.835	9.760

λ₂	F2F	M2M	F2M	M2F
0.5	9.105	8.194	8.664	9.576
0.6	9.143	8.125	8.517	9.545
0.7	9.142	8.301	8.679	9.590
1.0	9.233	8.437	8.580	9.536
1.5	9.252	8.467	8.835	9.484

激活函数	MCD				L_rec
激活函数	F2F	M2M	F2M	M2F	L_rec
None	11.304	9.120	11.112	11.372	0.152
tanh	11.840	11.064	11.029	11.976	0.203
ReLU	12.163	11.917	11.919	12.237	0.174
ELU	11.874	11.827	11.714	12.010	0.150
Sigmoid1	9.730	8.792	9.216	9.576	0.151
Sigmoid2	9.105	8.194	8.664	9.576	0.147
Sigmoid3	9.736	8.792	9.639	10.189	0.149

方法	F2F	M2M	F2M	M2F
AGAIN-VC	9.706	8.786	9.330	10.223
RSAE(SC-L)	10.317	9.019	9.426	10.569
RSAE(SS-L)	9.588	8.557	9.190	10.069
RSAE(2Enc)	10.250	9.174	10.163	10.826
RSAE-VC	9.486	8.477	9.011	9.894

方法	F2F/Hz	M2M/Hz	F2M/Hz	M2F/Hz
AGAIN-VC	97.371	69.956	72.332	96.497
RSAE(SC-L)	97.889	72.759	69.878	97.418
RSAE(SS-L)	99.337	71.319	68.936	97.810
RSAE(2Enc)	98.101	72.779	69.377	96.677
RSAE-VC	94.483	67.662	69.358	96.579