Telecommunications Science ›› 2019, Vol. 35 ›› Issue (12): 79-89.doi: 10.11959/j.issn.1000-0801.2019290

• Research and development • Previous Articles     Next Articles

End-to-end audiovisual speech recognition based on attention fusion of SDBN and BLSTM


  1. Institute of Communication Technology,Ningbo University,Ningbo 315211,China
  • Revised:2019-12-10 Online:2019-12-20 Published:2020-01-15
  • Supported by:
    The National Natural Science Foundation of China(60972063);The Natural Science Foundation of Ningbo of China(2014A610065);Scientific Research Foundation of Ningbo University(XKXL1308)


An end-to-end audiovisual speech recognition algorithm was proposed.In algorithm,a sparse DBN was constructed by introducing mixed l<sub>1/2</sub>norm and l<sub>1</sub>norm into Deep Belief Network with bottleneck structure to extract the sparse bottleneck features,so as to reduce the dimension of data features,and then a BLSTM was used to model the feature in time series.Then,a attention mechanism was used to align and fuse the lip visual information and audio auditory information automatically.Finally,the fused audiovisual information was classified and identified by a BLSTM with a Softmax layer attached.Experiments show that the algorithm can effectively identify visual and auditory information,and has good recognition rate and robustness in similar algorithms.

Key words: end-to-end, audiovisual speech recognition, sparse bottleneck features, attention mechanism

CLC Number: 

No Suggested Reading articles found!