采用恒Q调制包络的合成语音伪装检测方法

doi:10.11959/j.issn.1000-0801.2023187

电信科学 ›› 2023, Vol. 39 ›› Issue (11): 107-115.doi: 10.11959/j.issn.1000-0801.2023187

• 研究与开发 • 上一篇

采用恒Q调制包络的合成语音伪装检测方法

徐嘉¹, 简志华¹^,², 金宏辉¹, 吴超¹

¹ 杭州电子科技大学通信工程学院，浙江杭州 310018
² 浙江省数据存储传输及应用技术研究重点实验室，浙江杭州 310018

修回日期:2023-09-30 出版日期:2023-11-01 发布日期:2023-11-01
作者简介:徐嘉（1998- ），女，杭州电子科技大学通信工程学院硕士生，主要研究方向为语音伪装检测
简志华（1978- ），男，博士，杭州电子科技大学通信工程学院副教授、硕士生导师，浙江省数据存储传输及应用技术研究重点实验室教师，主要研究方向为语音转换、伪装语音检测、声纹识别等
金宏辉（1999- ），男，杭州电子科技大学通信工程学院硕士生，主要研究方向为语音转换和伪装检测
吴超（1988- ），男，博士，杭州电子科技大学通信工程学院讲师、硕士生导师，主要研究方向为导航信号处理及欺骗干扰检测
基金资助:
国家自然科学基金资助项目(61201301);国家自然科学基金资助项目(61772166);国家自然科学基金资助项目(61901154)

A method of synthetic speech spoofing detection using constant Q modulation envelope

Jia XU¹, Zhihua JIAN¹^,², Honghui JIN¹, Chao WU¹

¹ School of Communication Engineering, Hangzhou Dianzi University, Hangzhou 310018, China
² Key Laboratory of Data Storage and Transmission Technology of Zhejiang Province, Hangzhou 310018, China

Revised:2023-09-30 Online:2023-11-01 Published:2023-11-01
Supported by:
The National Natural Science Foundation of China(61201301);The National Natural Science Foundation of China(61772166);The National Natural Science Foundation of China(61901154)

摘要/Abstract

摘要：

针对传统的声学特征参数对合成语音伪装检测时存在的准确度低、未知类型合成语音检测效果较差、在噪声环境中表现欠佳的情况，提出了一种采用恒Q调制包络（constant Q modulation envelope，CQME）的合成伪装语音检测方法。该方法基于语音时域包络中包含的丰富信息，而合成语音与真实语音的包络在细节上存在较大差异，利用恒Q变换（constant Q transform，CQT）得到语音调制包络谱，并计算每个频率成分的均方根，获得CQME特征向量。再用该特征向量训练随机森林分类器，实现真伪语音的判别。实验结果表明，在ASVspoof 2019数据集上，CQME特征训练的随机森林具有较高的检测性能，对未知类型的合成语音也具有较好的检测效果。并且在多种噪声条件下，该方法仍表现出较高的检测性能，具有很好的噪声鲁棒性。

关键词: 合成语音, 伪装语音检测, 恒Q调制包络, 随机森林

Abstract:

In response to the low accuracy of synthetic speech spoofing detection based on traditional acoustic feature parameters, poor detection performance for unknown types of synthetic speech, and performance degradation in noisy environments, a method for detecting spoofing synthetic speech was proposed using constant Q modulation envelope (CQME) .The motivation of the method was from the fact that the temporal envelope of speech contained abundant information and there was a big difference in detail between the envelope of synthetic speech and genuine speech.The modulation envelope spectrum of speech was obtained by employing constant Q transform (CQT), and the root mean square of each frequency component was calculated to derive the CQME feature vector.And then the CQME feature vector was used to train the random forest classifier for discriminating genuine speech from spoofing synthetic speech.Experimental results demonstrate that the random forest trained with CQME features achieves high detection performance on the ASVspoof 2019 dataset and exhibites good detection efficacy for unknown types of synthetic speech.Furthermore, the proposed method shows high detection performance even under various noise conditions, having excellent noise robustness.

Key words: synthetic speech, spoofing speech detection, constant Q modulation envelope, random forest

中图分类号:

TP391.42

徐嘉, 简志华, 金宏辉, 吴超. 采用恒Q调制包络的合成语音伪装检测方法[J]. 电信科学, 2023, 39(11): 107-115.

Jia XU, Zhihua JIAN, Honghui JIN, Chao WU. A method of synthetic speech spoofing detection using constant Q modulation envelope[J]. Telecommunications Science, 2023, 39(11): 107-115.

图/表 8

图1

图2

图3

表1

表2

表3

图4

表4

参考文献 20

[1]	TAN C B , HIJAZI M H A , KHAMIS N ,et al. A survey on presentation attack detection for automatic speaker verification systems:state-of-the-art,taxonomy,issues and future direction[J]. Multimedia Tools and Applications, 2021,80(21-23): 32725-32762.
[2]	徐嘉, 简志华, 金宏辉 ,等. 基于中心对称局部二值模式的合成伪装语音检测方法[J]. 电信科学, 2023,39(1): 72-78.
	XU J , JIAN Z H , JIN H H ,et al. A method for synthetic spoofing speech detection based on center-symmetric local binary pattern[J]. Telecommunications Science, 2023,39(1): 72-78.
[3]	MITTAL A , DUA M . Automatic speaker verification systems and spoof detection techniques:review and analysis[J]. International Journal of Speech Technology, 2021,25(1): 105-134.
[4]	ALZANTOT M , WANG Z , SRIVASTAVA M B . Deep residual neural networks for audio spoofing detection[C]// Proceedings of 20th Annual Conference of the International Speech Communication Association 2019 (INTERSPEECH 2019). Graz,Austria:ISCA, 2019: 1078-1082.
[5]	NAGAKRISHNAN R , REVATHI A . Generic speech based person authentication system with genuine and spoofed utterances:different feature sets and models[J]. Multimedia Tools and Applications, 2021,81(1): 1179-1208.
[6]	TODISCO M , HéCTOR D , EVANS N . Constant Q cepstral coefficients:a spoofing countermeasure for automatic speaker verification[J]. Computer Speech ＆ Language, 2017(45): 516-535.
[7]	RAJAN P , PARTHASARATHI S , MURTHY H A . Robustness of phase based features for speaker recognition[C]// Proceedings of 10th Annual Conference of the International Speech Communication Association 2009 (INTERSPEECH 2009). Brighton:ISCA, 2009: 2299-2302.
[8]	SARATXAGA I , SANCHEZ J , WU Z ,et al. Synthetic speech detection using phase information[J]. Speech Communication, 2016(81): 30-41.
[9]	DRULLMAN R , FESTEN J M , PLOMP R . Effect of temporal envelope smearing on speech reception[J]. The Journal of the Acoustical Society of America, 1994,95(2): 1053-1064.
[10]	LU X , UNOKI M , NAKAMURA S . Sub-band temporal modulation envelopes and their normalization for automatic speech recognition in reverberant environments[J]. Computer Speech ＆Language, 2011,25(3): 571-584.
[11]	DING N , PATEL A D , CHEN L ,et al. Temporal modulations in speech and music[J]. Neuroscience ＆ Biobehavioral Reviews, 2017(81): 181-187.
[12]	NING Y , HE S , WU Z ,et al. A review of deep learning based speech synthesis[J]. Applied Sciences, 2019,9(19): 4050.
[13]	林朗, 王让定, 严迪群 ,等. 基于逆梅尔对数频谱系数的回放语音检测算法[J]. 电信科学, 2018,34(5): 90-98.
	LIN L , WANG R D , YAN D Q ,et al. A playback speech detection algorithm based on log inverse Mel-frequency spectral coefficient[J]. Telecommunications Science, 2018,34(5): 90-98.
[14]	BROWN J C . Calculation of a constant Q spectral transform[J]. Journal of the Acoustical Society of America, 1998,89(1): 425-434.
[15]	HAMSA S , SHAHIN I , IRAQI Y ,et al. Emotion recognition from speech using wavelet packet transform cochlear filter bank and random forest classifier[J]. IEEE Access, 2020(8): 96994-97006.
[16]	CHEN L , SU W , FENG Y ,et al. Two-layer fuzzy multiple random forest for speech emotion recognition in human-robot interaction[J]. Information Sciences, 2020(509): 150-163.
[17]	RAMOSAJ B , PAULY M . Consistent estimation of residual variance with random forest out-of-bag errors[J]. Statistics ＆Probability Letters, 2019(151): 49-57.
[18]	WANG X , YAMAGISHI J , TODISCO M ,et al. ASVspoof 2019:a large-scale public database of synthesized,converted and replayed speech[J]. Computer Speech ＆ Language, 2020(64): 101114.
[19]	KINNUNEN T , DELGADO H , EVANS N ,et al. Tandem assessment of spoofing countermeasures and automatic speaker verification:fundamentals[J]. IEEE/ACM Transactions on Au-dio,Speech,and Language Processing, 2020(28): 2195-2210.
[20]	WANG X , TAKAKI S , YAMAGISHI J . Neural source-filterbased waveform model for statistical parametric speech synthesis[C]// 2019 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). Piscataway:IEEE Press, 2019: 5916-5920.

子集	说话人数目		语音数目
子集	男性	女性	真实	伪装	合成
训练集	8	12	2 580	22 800	15 200
开发集	4	6	2 548	22 296	14 864
评估集	21	27	7 355	63 882	49 140

K	SVM	随机森林
8	0.703 1	0.362 6
16	0.202 6	0.136 6
32	0.113 3	0.062 4
64	0.122 6	0.078 0
128	0.123 2	0.081 7
256	0.107 9	0.096 2

特征	开发集		评估集
特征	SVM	RF	SVM	RF
MFCC	0.380 0	0.357 1	0.743 8	0.731 3
LPCC	0.349 4	0.211 6	0.631 1	0.495 4
LFCC	0.124 6	0.122 8	0.187 3	0.136 7
CQCC	0.159 0	0.130 0	0.172 9	0.172 0
MS	0.164 8	0.084 5	0.189 3	0.098 5
CQME	0.098 2	0.060 1	0.113 3	0.062 4

分类器	CQME噪声鲁棒性
分类器	无噪	babble噪声	volvo噪声	pink噪声
SVM	0.113 3	0.157 4	0.130 8	0.153 2
RF	0.062 4	0.095 0	0.072 5	0.089 5

采用恒Q调制包络的合成语音伪装检测方法

A method of synthetic speech spoofing detection using constant Q modulation envelope

在线阅读

PDF下载

可视化

摘要/Abstract

引用本文

使用本文

图/表 8

参考文献 20

相关文章 12

Metrics

推荐阅读 0

[1]	云本胜, 干潇雅, 钱亚冠. 一种基于随机森林和改进卷积神经网络的网络流量分类方法[J]. 电信科学, 2023, 39(7): 80-89.
[2]	金宏辉, 简志华, 杨曼, 吴超. 采用圆周局部三值模式纹理特征的合成语音检测方法[J]. 电信科学, 2023, 39(6): 85-95.
[3]	徐嘉, 简志华, 金宏辉, 吴超, 游林, 吴迎笑. 基于中心对称局部二值模式的合成伪装语音检测方法[J]. 电信科学, 2023, 39(1): 72-78.
[4]	于佳祺, 简志华, 徐嘉, 游林, 汪云路, 吴超. 基于联合特征与随机森林的伪装语音检测[J]. 电信科学, 2022, 38(6): 91-99.
[5]	卢子萌,陈佳怡,李璟,谢岳,蒋欣利,韩蕾,郭倩. 基于加权随机森林算法的空巢电力用户识别方法[J]. 电信科学, 2020, 36(8): 112-121.
[6]	张溶芳,许丹丹,王元光,潘思宇,李正茂. 机器学习在物联网虚假用户识别中的运用[J]. 电信科学, 2019, 35(7): 136-144.
[7]	文鹏,彭宗举,陈芬,蒋刚毅,郁梅. 基于随机森林的HEVC复杂度控制方法[J]. 电信科学, 2019, 35(2): 14-26.
[8]	王彦青,王瀚辰. 一种识别骚扰电话的组合算法研究[J]. 电信科学, 2017, 33(7): 112-119.
[9]	杜续,冯景瑜,吕少卿,石薇. 基于随机森林回归分析的PM2.5浓度预测模型[J]. 电信科学, 2017, 33(7): 66-75.
[10]	李倩,江昊,杨锦涛. 基于手机上网记录数据的个体相遇预测[J]. 电信科学, 2017, 33(10): 115-123.
[11]	刘歌,张国毅,于岩. 基于随机森林的雷达信号脉内调制识别[J]. 电信科学, 2016, 32(5): 69-78.
[12]	王铮,任华,方燕萍. 随机森林在运营商大数据补全中的应用[J]. 电信科学, 2016, 32(12): 7-12.