电信科学 ›› 2023, Vol. 39 ›› Issue (6): 85-95.doi: 10.11959/j.issn.1000-0801.2023121

• 研究与开发 • 上一篇    下一篇

采用圆周局部三值模式纹理特征的合成语音检测方法

金宏辉1, 简志华1,2, 杨曼1, 吴超1   

  1. 1 杭州电子科技大学通信工程学院,浙江 杭州 310018
    2 浙江省数据存储传输及应用技术研究重点实验室,浙江 杭州 310018
  • 修回日期:2023-06-05 出版日期:2023-06-20 发布日期:2023-06-01
  • 作者简介:金宏辉(1999- ),男,杭州电子科技大学通信工程学院硕士生,主要研究方向为伪装语音检测
    简志华(1978- ),男,杭州电子科技大学通信工程学院副教授、硕士生导师,浙江省数据存储传输及应用技术研究重点实验室教师,主要研究方向为语音转换、伪装语音检测、语音中的隐私保护等
    杨曼(2000- ),女,杭州电子科技大学通信工程学院硕士生,主要研究方向为伪装语音检测
    吴超(1988- ),男,杭州电子科技大学通信工程学院讲师,主要研究方向为导航信号处理及欺骗干扰检测
  • 基金资助:
    国家自然科学基金资助项目(61201301);国家自然科学基金资助项目(61772166);国家自然科学基金资助项目(61901154)

Synthetic speech detection method using texture feature based on circumferential local ternary pattern

Honghui JIN1, Zhihua JIAN1,2, Man YANG1, Chao WU1   

  1. 1 School of Communication Engineering, Hangzhou Dianzi University, Hangzhou 310018, China
    2 Key Laboratory of Data Storage and Transmission Technology of Zhejiang Province, Hangzhou 310018, China
  • Revised:2023-06-05 Online:2023-06-20 Published:2023-06-01
  • Supported by:
    The National Natural Science Foundation of China(61201301);The National Natural Science Foundation of China(61772166);The National Natural Science Foundation of China(61901154)

摘要:

为了进一步提高合成语音检测的准确率,提出了一种采用圆周局部三值模式(CLTP)纹理特征的合成语音检测方法。该方法利用圆周局部三值模式提取语谱图中的纹理信息并作为语音的特征表示,采用深度残差网络作为后端分类器来判决语音真伪。实验结果表明,在ASVspoof 2019数据集上,与传统的常量Q倒谱系数(CQCC)和线性预测倒谱系数(LPCC)两种特征相比,该方法在等错误率(EER)上分别降低了54.29%和 2.15%,与局部三值模式(LTP)纹理特征相比,该方法在等错误率上也降低了 17.14%。圆周局部三值模式由于综合考虑了邻域内中心像素与周边像素之间以及各周边像素之间的差异,更加全面地获取了语谱图的纹理信息,提高了合成语音检测的准确率。

关键词: 说话人验证, 合成语音检测, 圆周局部三值模式, 深度残差网络

Abstract:

In order to further improve the accuracy of synthetic speech detection, a synthetic speech detection method using texture feature based on circumferential local ternary pattern (CLTP) was proposed.The method extracted the texture information from the speech spectrogram using the CLTP and applied it as the feature representation of speech.The deep residual network was employed as the back-end classifier to determine the real or spoofing speech.The experimental results demonstrate that, on the ASVspoof 2019 dataset, the proposed method reduces the equal error rate (EER) by 54.29% and 2.15% respectively, compared with the traditional constant Q cepstral coefficient (CQCC) and linear predictive cepstral coefficient (LPCC), and reduced the EER by 17.14% compared with the local ternary pattern(LTP) texture features.The CLTP comprehensively takes into account the differences between the central and peripheral pixels in the neighborhood and between each peripheral pixel.Then it can acquire more texture information from the speech spectrogram, and improve the accuracy of synthetic speech detection.

Key words: speaker verification, synthetic speech detection, CLTP, deep residual network

中图分类号: 

No Suggested Reading articles found!