通信学报 ›› 2024, Vol. 45 ›› Issue (2): 162-172.doi: 10.11959/j.issn.1000-436x.2024044

• 学术论文 • 上一篇    

采用表示分离自编码器的任意说话人语音转换

简志华, 章子旭   

  1. 杭州电子科技大学通信工程学院,浙江 杭州 310018
  • 修回日期:2024-01-17 出版日期:2024-02-01 发布日期:2024-02-01
  • 作者简介:简志华(1978− ),男,江西新余人,博士,杭州电子科技大学副教授,主要研究方向为智能语音处理、语音转换、伪造语音检测、语音隐私保护等
    章子旭(1999− ),男,浙江杭州人,杭州电子科技大学硕士生,主要研究方向为语音转换
  • 基金资助:
    国家自然科学基金资助项目(61201301);国家自然科学基金资助项目(61772166)

Any-to-any voice conversion using representation separation auto-encoder

Zhihua JIAN, Zixu ZHANG   

  1. School of Communication Engineering, Hangzhou Dianzi University, Hangzhou 310018, China
  • Revised:2024-01-17 Online:2024-02-01 Published:2024-02-01
  • Supported by:
    The National Natural Science Foundation of China(61201301);The National Natural Science Foundation of China(61772166)

摘要:

针对非平行语料库下任意说话人之间的语音转换存在语言内容信息和说话人个性特征难以分离,从而导致语音转换的性能不佳的问题,提出了一种采用表示分离自编码器的语音转换方法 RSAE-VC。该方法将语音信号的说话人个性特征视为时不变,而将内容信息视为时变,利用编码器中的实例归一化和激活引导层将两者进行分离,再由解码器将源语音的内容信息与目标语音的个性特征进行合成,从而生成转换后的语音。实验结果表明, RSAE-VC在梅尔倒谱距离上比现有的AGAIN-VC转换方法平均降低了3.11%,在基音频率均方根误差上降低了2.41%,MOS分和ABX值分别提升了5.22%和8.45%。RSAE-VC方法通过自内容损失进行约束使语音更好地保留内容信息,通过自说话人损失将说话人个性特征更好地从语音中分离,可以确保说话人个性特征尽少地遗留在内容信息中,从而提高语音转换性能。

关键词: 语音转换, 表示分离, 自适应实例归一化, 自内容损失, 自说话人损失

Abstract:

In view of the problem that it was difficult to separate speaker personality characteristics from semantic content information in any-to-any voice conversion under non-parallel corpus, which led to unsatisfied performance, a voice conversion method, called RSAE-VC (representation separation auto-encoder voice conversion) was proposed.The speaker’s personality characteristics in the speech were regarded as time invariant and the content information as time variant, and the instance normalization and activation guidance layer were used in the encoder to separate them from each other.Then the content information of the source speech and the personality characteristics of the target one was utilized to synthesize the converted speech by the decoder.The experimental results demonstrate that RSAE-VC has an average reduction of 3.11% and 2.41% in Mel cepstral distance and root mean square error of pitch frequency respectively, and has an increasement of 5.22% in MOS and 8.45% in ABX, compared with the AGAIN-VC (activation guidance and adaptive instance normalization voice conversion) method.In RSAE-VC, self-content loss is applied to make the converted speech reserve more content information, and self-speaker loss is used to separate the speaker personality characteristics from the speech better, which ensure the speaker personality characteristics be left in the content information as little as possible, and the conversion performance is improved.

Key words: voice conversion, representation separation, adaptive instance normalization, self-content loss, self-speaker loss

中图分类号: 

No Suggested Reading articles found!