跨域注意力特征融合的说话人确认方法

doi:10.11959/j.issn.1000-436x.2023142

Abstract

Abstract:

Aiming at the problem that the lack of structure information among speech signal sample in the front-end acoustic features of speaker verification system, a speaker verification method based on cross-domain attentive feature fusion was proposed.Firstly, a feature extraction method based on the graph signal processing (GSP) was proposed to extract the structural information of speech signals, each sample point in a speech signal frame was regarded as a graph node to construct the speech graph signal and the graph frequency information of the speech signal was extracted through the graph Fourier transform and filter banks.Then, an attentive feature fusion network with the residual neural network and the squeeze-and- excitation block was proposed to fuse the features in the traditional time-frequency domain and those in the graph frequency domain to promote the speaker verification system performance.Finally, the experiment was carried out on the VoxCeleb, SITW, and CN-Celeb datasets.The experimental results show that the proposed method performs better than the baseline ECAPA-TDNN model in terms of equal error rate (EER) and minimum detection cost function (min-DCF).

Key words: speaker verification, graph signal processing, attentive feature fusion

CLC Number:

TN912.34

Zhen YANG, Tianlang WANG, Haiyan GUO, Tingting WANG. Speaker verification method based on cross-domain attentive feature fusion[J]. Journal on Communications, 2023, 44(8): 89-98.

Figures/Tables 16

方法	模型	Vox1-O cl.	Vox1-E cl.	Vox1-H cl.
基于ResNet	ResNet34-GAT	1.75%	—	—
	ResNet34	1.46%	1.55%	2.76%
	ResNet34-ft-CBAM	1.08%	1.43%	2.67%
基于TDNN	ECAPA-TDNN	1.05%	1.28%	2.41%
	MFCC + FDLP +wav2vec	2.86%	—	—
基于Transformer	SAEP	2.91%	2.87%	4.75%
	GCSA	1.96%	2.07%	3.65%
基于MLP	MLP-SVNet	1.36%	1.46%	2.49%
本文方法	ET-AFF-CS128	$0 . 95 %$	$1 . 12 %$	$1 . 12 %$

References 32

[1]	ATAL B S . Automatic recognition of speakers from their voices[J]. Proceedings of the IEEE, 1976,64(4): 460-475.
[2]	杨震, 王婷婷 . 语音图信号处理理论与技术研究[J]. 南京邮电大学学报(自然科学版), 2020,40(5): 43-51.
	YANG Z , WANG T T . Research on speech graph signal processing theory and technology[J]. Journal of Nanjing University of Posts and Telecommunications (Natural Science), 2020,40(5): 43-51.
[3]	JUNG J W , HEO H S , YU H J ,et al. Graph attention networks for speaker verification[C]// Proceedings of 2021 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). Piscataway:IEEE Press, 2021: 6149-6153.
[4]	SHIM H J , HEO J , PARK J H ,et al. Graph attentive feature aggregation for text-independent speaker verification[C]// Proceedings of 2022 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). Piscataway:IEEE Press, 2022: 7972-7976.
[5]	LIU B , CHEN Z Y , QIAN Y M . Attentive feature fusion for robust speaker verification[C]// Proceedings of Interspeech 2022. New York:ACM Press, 2022: 286-290.
[6]	SANKALA S , RAFI B S M , K S R M . Multi-feature integration for speaker embedding extraction[C]// Proceedings of 2022 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). Piscataway:IEEE Press, 2022: 7957-7961.
[7]	林云, 徐怀韬, 王森 ,等. 基于特征融合的通信语音干扰效果客观评估[J]. 通信学报, 2023,44(3): 105-116.
	LIN Y , XU H T , WANG S ,et al. Objective assessment of communication speech interference effect based on feature fusion[J]. Journal on Communications, 2023,44(3): 105-116.
[8]	郑金志, 汲如意, 张立波 ,等. 基于Transformer解码的端到端场景文本检测与识别算法[J]. 通信学报, 2023,44(5): 64-78.
	ZHENG J Z , JI R Y , ZHANG L B ,et al. End-to-end scene text detection and recognition algorithm based on Transformer decoders[J]. Journal on Communications, 2023,44(5): 64-78.
[9]	秦志金, 赵菼菼, 李凡 ,等. 多模态语义通信研究综述[J]. 通信学报, 2023,44(5): 28-41.
	QIN Z J , ZHAO T T , LI F ,et al. Survey of research on multimodal semantic communication[J]. Journal on Communications, 2023,44(5): 28-41.
[10]	ORTEGA A , FROSSARD P , KOVA?EVI? J , ,et al. Graph signal processing:overview,challenges,and applications[J]. Proceedings of the IEEE, 2018,106(5): 808-828.
[11]	YAN X , YANG Z , WANG T ,et al. An iterative graph spectral subtraction method for speech enhancement[J]. Speech Communication, 2020,123: 35-42.
[12]	WANG T T , GUO H Y , YAN X ,et al. Speech signal processing on graphs:the graph frequency analysis and an improved graph Wiener filtering method[J]. Speech Communication, 2021,127: 82-91.
[13]	WANG T T , GUO H Y , ZHANG Q Q ,et al. A new multilayer graph model for speech signals with graph learning[J]. Digital Signal Processing, 2022:doi.org/10.1016/j.dsp.2021.103360.
[14]	WANG T T , PAN Z X , GE M ,et al. Time-domain speech separation networks with graph encoding auxiliary[J]. IEEE Signal Processing Letters, 2023,30: 110-114.
[15]	ZHANG C H , PAN X . Single-channel speech enhancement using graph Fourier transform[C]// Proceedings of Interspeech 2022. New York:ACM Press, 2022: 946-950.
[16]	HE K M , ZHANG X Y , REN S Q ,et al. Deep residual learning for image recognition[C]// Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2016: 770-778.
[17]	HU J , SHEN L , SUN G . Squeeze-and-excitation networks[C]// Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2018: 7132-7141.
[18]	DESPLANQUES B , THIENPONDT J , DEMUYNCK K . ECAPA-TDNN:emphasized channel attention,propagation and aggregation in TDNN based speaker verification[C]// Proceedings of Interspeech 2020. New York:ACM Press, 2020: 3830-3834.
[19]	NAGRANI A , CHUNG J S , ZISSERMAN A . VoxCeleb:a large-scale speaker identification dataset[J]. arXiv Preprint,arXiv:1706.08612, 2017.
[20]	CHUNG J S , NAGRANI A , ZISSERMAN A . VoxCeleb2:deep speaker recognition[J]. arXiv Preprint,arXiv:1806.05622, 2018.
[21]	MCLAREN M , FERRER L , CASTAN D ,et al. The speakers in the wild (SITW) speaker recognition database[C]// Proceedings of Interspeech 2016. New York:ACM Press, 2016: 818-822.
[22]	FAN Y , KANG J W , LI L T ,et al. CN-celeb:a challenging Chinese speaker recognition dataset[C]// Proceedings of 2020 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). Piscataway:IEEE Press, 2020: 7604-7608.
[23]	ZEINALI H , WANG S , SILNOVA A ,et al. BUT system description to VoxCeleb speaker recognition challenge 2019[J]. arXiv Preprint,arXiv:1910.12592, 2019.
[24]	SAFARI P , INDIA M , HERNANDO J . Self-attention encoding and pooling for speaker recognition[C]// Proceedings of Interspeech 2020. New York:ACM Press, 2020: 941-945.
[25]	HAN B , CHEN Z Y , QIAN Y M . Local information modeling with self-attention for speaker verification[C]// Proceedings of 2022 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). Piscataway:IEEE Press, 2022: 6727-6731.
[26]	SNYDER D , CHEN G , POVEY D . MUSAN:a music,speech,and noise corpus[J]. arXiv Preprint,arXiv:1510.08484, 2015.
[27]	KO T , PEDDINTI V , POVEY D ,et al. A study on data augmentation of reverberant speech for robust speech recognition[C]// Proceedings of 2017 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). Piscataway:IEEE Press, 2017: 5220-5224.
[28]	PARK D S , CHAN W , ZHANG Y ,et al. SpecAugment:a simple data augmentation method for automatic speech recognition[C]// Proceedings of Interspeech 2019. New York:ACM Press, 2019: 2613-2617.
[29]	DENG J , GUO J , YANG J ,et al. ArcFace:additive angular margin loss for deep face recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022,44(10): 5962-5979.
[30]	YADAV S , RAI A . Frequency and temporal convolutional attention for text-independent speaker recognition[C]// Proceedings of 2020 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). Piscataway:IEEE Press, 2020: 6794-6798.
[31]	HAN B , CHEN Z Y , LIU B ,et al. MLP-SVNET:a multi-layer perceptrons based network for speaker verification[C]// Proceedings of 2022 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). Piscataway:IEEE Press, 2022: 7522-7526.
[32]	MAATEN L V D , HINTON G . Visualizing data using t-SNE[J]. Journal of machine learning research, 2008,9(11): 2579-2605.

Metrics

Recommended 0

No Suggested Reading articles found!

模块名称	参数	输出
Conv1	核函数：3 × 3 × C，步长：1 × 1	C × F × T
Conv2	核函数：3 × 3 × 1，步长：1 × 1	1 × F × T
SE	衰减因子r=8	C × F × T

模型	参数量/个	Vox1-E cl.		Vox1-H cl.
模型	参数量/个	EER	minDCF	EER	minDCF
ET-FBank（基线）	14.73×10⁶	1.279%	0.085	2.411%	0.149
ET-AFF-CS32	14.75×10⁶	1.195%	0.076	2.146%	0.133
ET-AFF-CS64	14.80×10⁶	1.142%	0.073	2.084%	0.131
ET-AFF-CS128	15.03×10⁶	1.121%	0.070	2.010%	0.124

模型	SITW-dev		SITW-eval
模型	EER	minDCF	EER	minDCF
ET-FBank（基线）	1.927%	0.128	2.050%	0.133
ET-AFF-CS32	1.617%	0.108	1.804%	0.110
ET-AFF-CS64	1.711%	0.107	1.804%	0.110
ET-AFF-CS128	1.733%	0.098	1.725%	0.108

模型	CN-Celeb1-eval
模型	EER	minDCF
ET-FBank（基线）	15.868%	0.568
ET-AFF-CS32	14.559%	0.493
ET-AFF-CS64	14.302%	0.500
ET-AFF-CS128	14.959%	0.493

模型	参数量/个	Vox1-E cl.		Vox1-H cl.
模型	参数量/个	EER	minDCF	EER	minDCF
ET-ADD	14.73×10⁶	1.388%	0.086	2.534%	0.154
ET-CAT	15.12×10⁶	1.308%	0.082	2.357%	0.142
ET-AFF-CS128	15.03×10⁶	1.121%	0.070	2.010%	0.124

Speaker verification method based on cross-domain attentive feature fusion

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 16

References 32

Related Articles 1

Metrics

Recommended 0