采用表示分离自编码器的任意说话人语音转换

doi:10.11959/j.issn.1000-436x.2024044

Abstract

Abstract:

In view of the problem that it was difficult to separate speaker personality characteristics from semantic content information in any-to-any voice conversion under non-parallel corpus, which led to unsatisfied performance, a voice conversion method, called RSAE-VC (representation separation auto-encoder voice conversion) was proposed.The speaker’s personality characteristics in the speech were regarded as time invariant and the content information as time variant, and the instance normalization and activation guidance layer were used in the encoder to separate them from each other.Then the content information of the source speech and the personality characteristics of the target one was utilized to synthesize the converted speech by the decoder.The experimental results demonstrate that RSAE-VC has an average reduction of 3.11% and 2.41% in Mel cepstral distance and root mean square error of pitch frequency respectively, and has an increasement of 5.22% in MOS and 8.45% in ABX, compared with the AGAIN-VC (activation guidance and adaptive instance normalization voice conversion) method.In RSAE-VC, self-content loss is applied to make the converted speech reserve more content information, and self-speaker loss is used to separate the speaker personality characteristics from the speech better, which ensure the speaker personality characteristics be left in the content information as little as possible, and the conversion performance is improved.

Key words: voice conversion, representation separation, adaptive instance normalization, self-content loss, self-speaker loss

CLC Number:

TP391.42

Zhihua JIAN, Zixu ZHANG. Any-to-any voice conversion using representation separation auto-encoder[J]. Journal on Communications, 2024, 45(2): 162-172.

Figures/Tables 14

References 28

[1]	SISMAN B , YAMAGISHI J , KING S ,et al. An overview of voice conversion and its challenges:from statistical modeling to deep learning[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing, 2021,29: 132-157.
[2]	MOUCHTARIS A , AGIOMYRGIANNAKIS Y , STYLIANOU Y . Conditional vector quantization for voice conversion[C]// Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing. Piscataway:IEEE Press, 2007: 505-508.
[3]	AIHARA R , TAKASHIMA R , TAKIGUCHI T ,et al. GMM-based emotional voice conversion using spectrum and prosody features[J]. American Journal of Signal Processing, 2012,2(5): 134-138.
[4]	HELANDER E , SILEN H , VIRTANEN T ,et al. Voice conversion using dynamic kernel partial least squares regression[J]. IEEE Transactions on Audio,Speech,and Language Processing, 2012,20(3): 806-817.
[5]	WU Z Z , VIRTANEN T , CHNG E S ,et al. Exemplar-based sparse representation with residual compensation for voice conversion[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing, 2014,22(10): 1506-1521.
[6]	SUN L F , LI K , WANG H ,et al. Phonetic posterior grams for many-to-one voice conversion without parallel data training[C]// Proceedings of IEEE International Conference on Multimedia and Expo (ICME). Piscataway:IEEE Press, 2016: 1-6.
[7]	MURAKAMI H , HARA S , ABE M . DNN-based voice conversion with auxiliary phonemic information to improve intelligibility of glossectomy patients' speech[C]// Proceedings of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). Piscataway:IEEE Press, 2019: 138-142.
[8]	ALAA Y , ALFONSE M , AREF M M . A survey on generative adversarial networks based models for many-to-many non-parallel voice conversion[C]// Proceedings of 5th International Conference on Computing and Informatics (ICCI). Piscataway:IEEE Press, 2022: 221-226.
[9]	KANEKO T , KAMEOKA H , TANAKA K ,et al. CycleGAN-VC2:improved cyclegan-based non-parallel voice conversion[C]// Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). Piscataway:IEEE Press, 2019: 6820-6824.
[10]	KAMEOKA H , KANEKO T , TANAKA K ,et al. StarGAN-VC:non-parallel many-to-many voice conversion using star generative adversarial networks[C]// Proceedings of IEEE Spoken Language Technology Workshop (SLT). Piscataway:IEEE Press, 2018: 266-273.
[11]	QIAN K Z , ZHANG Y , CHANG S Y ,et al. AUTOVC:zero-shot voice style transfer with only autoencoder loss[C]// Proceedings of 36th International Conference on Machine Learning (ICML). Piscataway:IEEE Press, 2019: 5210-5219.
[12]	DENG C H , CHEN Y , DENG H F . One-shot voice conversion algorithm based on representations separation[J]. IEEE Access, 2020,8: 196578-196586.
[13]	CHEN Y H , WU D Y , WU T H ,et al. AGAIN-VC:a one-shot voice conversion using activation guidance and adaptive instance normalization[C]// Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). Piscataway:IEEE Press, 2021: 5954-5958.
[14]	WANG Q Q , ZHANG X L , WANG J Z ,et al. DRVC:a framework of any-to-any voice conversion with self-supervised learning[C]// Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). Piscataway:IEEE Press, 2022: 3184-3188.
[15]	DANG T , TRAN D , CHIN P ,et al. Training robust zero-shot voice conversion models with self-supervised features[C]// Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). Piscataway:IEEE Press, 2022: 6557-6561.
[16]	CHOU J C , LEE H Y . One-shot voice conversion by separating speaker and content representations with instance normalization[C]// Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH). Piscataway:IEEE Press, 2019: 664-668.
[17]	WANG X , TAKAKI S , YAMAGISHI J ,et al. A vector quantized variational autoencoder (VQ-VAE) autoregressive neural F₀model for statistical parametric speech synthesis[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing, 2019,28: 157-170.
[18]	WU D Y , LEE H Y . One-shot voice conversion by vector quantization[C]// Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). Piscataway:IEEE Press, 2020: 7734-7738.
[19]	YANG S D , YU X Y , ZHOU Y . LSTM and GRU neural network performance comparison study:taking yelp review dataset as an example[C]// Proceedings of International Workshop on Electronic Communication and Artificial Intelligence (IWECAI). Piscataway:IEEE Press, 2020: 98-101.
[20]	PRASAD S , MANU A , KAPOOR A ,et al. Non-parallel denoised voice conversion using vector quantisation[C]// Proceedings of 4th International Conference on Recent Trends in Computer Science and Technology (ICRTCST). Piscataway:IEEE Press, 2022: 78-83.
[21]	WANG Z C , XIE Q C , LI T ,et al. One-shot voice conversion for style transfer based on speaker adaptation[C]// Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). Piscataway:IEEE Press, 2022: 6792-6796.
[22]	KANEKO T , KAMEOKA H , TANAKA K ,et al. Maskcyclegan-VC:learning non-parallel voice conversion with filling in frames[C]// Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). Piscataway:IEEE Press, 2021: 5919-5923.
[23]	SONG K , CONG J , WANG X S ,et al. Robust MelGAN:a robust universal neural vocoder for high-fidelity TTS[C]// Proceedings of 13th International Symposium on Chinese Spoken Language Processing (ISCSLP). Piscataway:IEEE Press, 2022: 71-75.
[24]	周健, 刘荣敏, 窦云峰 ,等. 采用 L_1/2稀疏约束的梅尔倒谱系数语音重建方法[J]. 声学学报, 2018,43(6): 991-999.
	ZHOU J , LIU R M , DOU Y F ,et al. Speech reconstruction from Mel-frequency cepstral coefficients via L_1/2sparse constraint[J]. Acta Acustica, 2018,43(6): 991-999.
[25]	LEE S H , NOH H R , NAM W J ,et al. Duration controllable voice conversion via phoneme-based information bottleneck[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing, 2022,30: 1173-1183.
[26]	林云, 徐怀韬, 王森 ,等. 基于特征融合的通信语音干扰效果客观评估[J]. 通信学报, 2023,44(3): 105-116.
	LIN Y , XU H T , WANG S ,et al. Objective assessment of communication speech interference effect based on feature fusion[J]. Journal on Communications, 2023,44(3): 105-116.
[27]	PRIHASTO B , LIN Y X , LE P T ,et al. CNEG-VC:contrastive learning using hard negative example in non-parallel voice conversion[C]// Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). Piscataway:IEEE Press, 2023: 1-5.
[28]	SHAH N , SINGH M , TAKAHASHI N ,et al. Nonparallel emotional voice conversion for unseen speaker-emotion pairs using dual domain adversarial network ＆ virtual domain pairing[C]// Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). Piscataway:IEEE Press, 2023: 1-5.

Metrics

Recommended 0

No Suggested Reading articles found!

λ₁	F2F	M2M	F2M	M2F
2.0	9.111	8.284	8.831	9.427
2.5	9.119	8.401	8.897	9.420
3.0	9.174	8.421	8.862	9.672
3.5	9.075	8.358	8.641	9.466
4.0	9.318	8.388	8.835	9.760

λ₂	F2F	M2M	F2M	M2F
0.5	9.105	8.194	8.664	9.576
0.6	9.143	8.125	8.517	9.545
0.7	9.142	8.301	8.679	9.590
1.0	9.233	8.437	8.580	9.536
1.5	9.252	8.467	8.835	9.484

激活函数	MCD				L_rec
激活函数	F2F	M2M	F2M	M2F	L_rec
None	11.304	9.120	11.112	11.372	0.152
tanh	11.840	11.064	11.029	11.976	0.203
ReLU	12.163	11.917	11.919	12.237	0.174
ELU	11.874	11.827	11.714	12.010	0.150
Sigmoid1	9.730	8.792	9.216	9.576	0.151
Sigmoid2	9.105	8.194	8.664	9.576	0.147
Sigmoid3	9.736	8.792	9.639	10.189	0.149

方法	F2F	M2M	F2M	M2F
AGAIN-VC	9.706	8.786	9.330	10.223
RSAE(SC-L)	10.317	9.019	9.426	10.569
RSAE(SS-L)	9.588	8.557	9.190	10.069
RSAE(2Enc)	10.250	9.174	10.163	10.826
RSAE-VC	9.486	8.477	9.011	9.894

方法	F2F/Hz	M2M/Hz	F2M/Hz	M2F/Hz
AGAIN-VC	97.371	69.956	72.332	96.497
RSAE(SC-L)	97.889	72.759	69.878	97.418
RSAE(SS-L)	99.337	71.319	68.936	97.810
RSAE(2Enc)	98.101	72.779	69.377	96.677
RSAE-VC	94.483	67.662	69.358	96.579

Any-to-any voice conversion using representation separation auto-encoder

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 14

References 28

Related Articles 1

Metrics

Recommended 0

方法	时间/s
AGAIN-VC	0.177
RSAE(SC-L)	0.218
RSAE(SS-L)	0.255
RSAE(2Enc)	0.304
RSAE-VC	0.256