电信科学 ›› 2019, Vol. 35 ›› Issue (12): 79-89.doi: 10.11959/j.issn.1000-0801.2019290

• 研究与开发 • 上一篇    下一篇

基于SDBN和BLSTM注意力融合的端到端视听双模态语音识别

王一鸣,陈恳,萨阿卜杜萨拉木·艾海提拉木   

  1. 宁波大学信息科学与工程学院,浙江 宁波 315211
  • 修回日期:2019-12-10 出版日期:2019-12-20 发布日期:2020-01-15
  • 作者简介:王一鸣(1993- ),男,宁波大学信息科学与工程学院硕士生,主要研究方向为音视频信息处理、视听觉语音识别|陈恳(1962- ),男,宁波大学信息科学与工程学院副教授、硕士生导师,在核心期刊和重要国际会议发表论文共100余篇,参与和主持国家级、省部级、市厅级和校级科研项目共16项;获得相关科研相关奖项3项。主要研究方向为图像及视频信息处理、多媒体通信、智能控制|阿卜杜萨拉木·艾海提(1995- ),男,宁波大学信息科学与工程学院硕士生,主要研究方向为机器翻译、智能语音翻译
  • 基金资助:
    国家自然科学基金资助项目(60972063);宁波市自然科学基金资助项目(2014A610065);宁波大学科研基金(理)/学科资助项目(XKXL1308)

End-to-end audiovisual speech recognition based on attention fusion of SDBN and BLSTM

Yiming WANG,Ken CHEN,Aihaiti ABUDUSALAMU   

  1. Institute of Communication Technology,Ningbo University,Ningbo 315211,China
  • Revised:2019-12-10 Online:2019-12-20 Published:2020-01-15
  • Supported by:
    The National Natural Science Foundation of China(60972063);The Natural Science Foundation of Ningbo of China(2014A610065);Scientific Research Foundation of Ningbo University(XKXL1308)

摘要:

提出一种端到端的视听语音识别算法。在该算法中,通过具有瓶颈结构的深度信念网络(deep belief network,DBN)中引入混合的l<sub>1/2</sub>范数和l<sub>1</sub>范数构建一种稀疏DBN(sparse DBN,SDBN)来提取稀疏瓶颈特征,从而实现对数据的特征降维,然后用双向长短期记忆网络(bidirectional long short-term memory,BLSTM)在时序上对特征进行模态处理,之后利用一种注意力机制将经过模态处理的唇部视觉信息和音频听觉信息进行自动对齐、融合,最后将融合的视听觉信息通过一个附加了Softmax层的BLSTM进行分类识别。实验表明,该算法能有效地识别视听觉信息,在同类算法中有很好的识别率和顽健性。

关键词: 端到端, 视听语音识别, 稀疏瓶颈特征, 注意力机制

Abstract:

An end-to-end audiovisual speech recognition algorithm was proposed.In algorithm,a sparse DBN was constructed by introducing mixed l<sub>1/2</sub>norm and l<sub>1</sub>norm into Deep Belief Network with bottleneck structure to extract the sparse bottleneck features,so as to reduce the dimension of data features,and then a BLSTM was used to model the feature in time series.Then,a attention mechanism was used to align and fuse the lip visual information and audio auditory information automatically.Finally,the fused audiovisual information was classified and identified by a BLSTM with a Softmax layer attached.Experiments show that the algorithm can effectively identify visual and auditory information,and has good recognition rate and robustness in similar algorithms.

Key words: end-to-end, audiovisual speech recognition, sparse bottleneck features, attention mechanism

中图分类号: 

No Suggested Reading articles found!