语音识别技术的研究进展与展望

doi:10.11959/j.issn.1000-0801.2018095

Abstract

Abstract:

The purpose of automatic speech recognition (ASR) is to make the machine to be able to “understand” the human speech and transform it to readable text information.ASR is one of the key technologies of human machine interaction and also a hot research domain for a long time.In recent years,due to the application of deep neural networks,the use of big data and the popularity of cloud computing,ASR has made great progress and break through the threshold of application in many industries.More and more products with ASR have entered people’s daily life,such as Apple’s Siri,Amazon’s Alexa,IFLYTEK speech input method and Dingdong intelligent speaker and so on.The development status and key breakthrough technologies in recent years were introduced.Also,a forecast of ASR technologies’ trend of development was given.

Key words: automatic speech recognition, deep neural network, acoustic model, language model

CLC Number:

TP393

Haikun WANG,Jia PAN,Cong LIU. Research development and forecast of automatic speech recognition technologies[J]. Telecommunications Science, 2018, 34(2): 1-11.

Figures/Tables 4

References 54

[1]	DAVIS K H , BIDDULPH R , BALASHEK S . Automatic recognition of spoken digits[J]. Journal of the Acoustical Society of America, 1952,24(6): 637.
[2]	FERGUSON J D . Application of hidden Markov models to text and speech[EB]. 1980.
[3]	RABINER L R . A tutorial on hidden Markov models and selected applications in speech recognition[J]. Readings in Speech Recognition, 1990,77(2): 267-296.
[4]	LEEE K F L M . An overview of the SPHINX speech recognition system[J]. IEEE Transactions on Acoustics Speech ＆ Signal Processing Speech, 1990,38(1): 35-45.
[5]	WAIBEL A , HANAZAWA T , HINTON G . Phoneme recognition using time-delay neural networks[J]. IEEE Transactions on Acoustics,Speech,and Signal Processing, 1990,1(2): 393-404.
[6]	YOUNG S , EVERMANN G , GALES M ,et al. The HTK book[EB]. 2005.
[7]	HINTON G E , OSINDERO S , TEH Y W . A fast learning algorithm for deep belief nets[J]. Neural Computation, 2006,18(7): 1527-1554.
[8]	MOHAMED A R , DAHL G , HINTON G . Deep belief networks for phone recognition[EB]. 2009.
[9]	YU D , DENG L . Deep learning and its applications to signal and information processing[J]. IEEE Signal Processing Magazine, 2011,28(1): 145-154.
[10]	DENG L , . An overview of deep-structured learning for information processing[C]// Asian-Pacific Signal and Information Processing-Annual Summit and Conference (APSIPA-ASC),October 18,2011, Xi’an,China.[S.l.:s.n] 2011.
[11]	BENGIO Y . Learning deep architectures for AI[J]. Foundations and Trends? in Machine Learning, 2009,2(1): 1-127.
[12]	HINTON G E . Training products of experts by minimizing contrastive divergence[J]. Neural Computation, 2002,14(8): 1771-1800.
[13]	BAKER J , DENG L , GLASS J ,et al. Developments and directions in speech recognition and understanding[J]. IEEE Signal Processing Magazine, 2009,26(3): 75-80.
[14]	MOHAMED A R , DAHL G , HINTON G . Deep belief networks for phone recognition[EB]. 2009.
[15]	SAINATH T N , KINGSBURY B , RAMABHADRAN B ,et al. Making deep belief networks effective for large vocabulary continuous speech recognition[EB]. 2011.
[16]	MOHAMED A , DAHL G E , HINTON G . Acoustic modeling using deep belief networks[J]. IEEE Transactions on Audio,Speech,and Language Processing, 2012,20(1): 14-22.
[17]	DAHL G E , YU D , DENG L ,et al. Context-dependent pre-trained deep neural networks for large vocabulary speech recognition[J]. IEEE Transactions on Audio,Speech,and Language Processing, 2012,20(1): 30-42.
[18]	HINTON G , DENG L , YU D ,et al. Deep neural networks for acoustic modeling in speech recognition:the shared views of four research groups[J]. IEEE Signal Processing Magazine, 2012,29(6): 82-97.
[19]	HOCHREITER S , SCHMIDHUBER J . Long short-term memory[J]. Neural Computation, 1997,9(8): 1735-1780.
[20]	ZHANG Y , CHEN G G , YU D ,et al. Highway long short-term memory RNNS for distant speech recognition[C]// 2016 IEEE International Conference on Acoustics,Speech and Signal Processing,March 20-25,2016,Shanghai,China. Piscataway:IEEE Press, 2016.
[21]	ZHANG S L , LIU C , JIANG H ,et al. Feedforward sequential memory networks:a new structure to learn long-term dependency[J]. arXiv:1512.08301, 2015.
[22]	LECUN Y , BENGIO Y . Convolutional networks for images,speech and time-series[M]. Cambridge: MIT Press, 1995.
[23]	ABDEL-HAMID O , MOHAMED A R , JIANG H ,et al. Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition[C]// 2012 IEEE International Conference on Acoustics,Speech and Signal Processing,March 20,2012,Kyoto,Japan. Piscataway:IEEE Press, 2012: 4277-4280.
[24]	ABDEL-HAMID O , MOHAMED A R , JIANG H ,et al. Convolutional neural networks for speech recognition[J]. IEEE/ACM Transactions on Audio Speech ＆ Language Processing, 2014,22(10): 1533-1545.
[25]	ABDEL-HAMID O , DENG L , YU D . Exploring convolutional neural network structures and optimization techniques for speech recognition[EB]. 2013.
[26]	SAINATH T N , MOHAMED A R , KINGSBURY B ,et al. Deep convolutional neural networks for LVCSR[C]// 2013 IEEE International Conference on Acoustics,Speech and Signal Processing,May 26-30,2013,Vancouver,BC,Canada. Piscataway:IEEE Press, 2013: 8614-8618.
[27]	SAINATH T N , VINYALS O , SENIOR A ,et al. Convolutional,long short-term memory,fully connected deep neural networks[C]// 2015 IEEE International Conference on Acoustics,Speech and Signal Processing,April 19-24,Brisbane,QLD,Australia. Piscataway:IEEE Press, 2015: 4580-4584.
[28]	SEIDE F , LI G , YU D . Conversational speech transcription using context- dependent deep neural networks[C]// International Conference on Machine Learning,June 28-July 2,2011,Bellevue, Washington,USA.[S.l.:s.n] 2011: 437-440.
[29]	DAHL G E , YU D , DENG L ,et al. Large vocabulary continuous speech recognition with context-dependent DBNHMMs[C]// ICASSP,May 22-27,2011,Prague, Czech Republic.[S.l.:s.n] 2011: 4688-4691.
[30]	YU D , SEIDE F , LI G ,et al. Exploiting sparseness in deep neural networks for large vocabulary speech recognition[C]// ICASSP,March 25-30,2012, Kyoto,Japan.[S.l.:s.n] 2012: 4409-4412.
[31]	SAINATH T N , KINGSBURY B , SINDHWANI V ,et al. Low-rank matrix factorization for deep neural network training with high-dimensional output targets[C]// ICASSP,May 26-31,2013,Vancouver, BC,Canada,.[S.l.:s.n] 2013: 6655-6659.
[32]	KONTáR S , . Parallel training of neural networks for speech recognition[C]// 13th International Conference on Text,Speech and Dialogue,September 6-10,2010,Brno,Czech Republic. New York:ACM Press, 2006: 6-10.
[33]	VESELY K , BURGET L , GRéZL F . Parallel training of neural networks for speech recognition[C]// 13th International Conference on Text,Speech and Dialogue,September 6-10,2010,Brno,Czech Republic. New York:ACM Press, 2006: 439-446.
[34]	PARK J , DIEHL F , GALES M J F ,et al. Efficient generation and use of MLP features for Arabic speech recognition[C]// Interspeech,Conference of the International Speech Communication Association,September 6-10,2009, Brighton,UK.[S.l.:s.n] 2009: 236-239.
[35]	LE Q V , RANZATO M A , MONGA R ,et al. Building high-level features using large scale unsupervised learning[J]. arXiv preprint arXiv:1112.6209, 2011.
[36]	ZHANG S , ZHANG C , YOU Z ,et al. Asynchronous stochastic gradient descent for DNN training[C]// IEEE International Conference on Acoustics,June 27-July 2,2013,Santa Clara Marriott,CA,USA. Piscataway:IEEE Press, 2013: 6660-6663.
[37]	CHEN X , EVERSOLE A , LI G ,et al. Pipelined back-propagation for context-dependent deep neural networks[C]// 13th Annual Conference of the International Speech Communication Association,September 9-13,2012,Portland, OR,USA.[S.l:s.n] 2012: 429-433.
[38]	ZHOU P , LIU C , LIU Q ,et al. A cluster-based multiple deep neural networks method for large vocabulary continuous speech recognition[C]// ICASSP,May 26-31,2013,Vancouver, BC,Canada.[S.l.:s.n] 2013: 6650-6654.
[39]	JELINEK F . The development of an experimental discrete dictation recognizer[J]. Readings in Speech Recognition, 1990,73(11): 1616-1624.
[40]	BENGIO Y , DUCHARME R , VINCENT P . A neural probabilistic language model[J]. Journal of Machine Learning Research, 2003(3): 1137-1155.
[41]	SCHWENK H , GAUVAIN J L . Training neural network language models on very large corpora[C]// Conference on Human Language Technology ＆ Empirical Methods in Natural Language Processing,October 6-8,2005,Vancouver,BC,Canada. New York:ACM Press, 2005: 201-208.
[42]	AR?SOY E , SAINATH T N , KINGSBURY B ,et al. Deep neural network language models[C]// NAACL-HLT 2012 Workshop,June 8,2012,Montreal,Canada. New York:ACM Press, 2012: 20-28.
[43]	MIKOLOV T , KARAFIAT M , BURGET L ,et al. Recurrent neural network based language model[C]// 11th Annual Conference of the International Speech Communication Association,September 26-30,2010,Makuhari, Chiba,Japan.[S.l.:s.n] 2010: 1045-1048.
[44]	CHEN X , WANG Y , LIU X ,et al. Efficient GPU-based training of recurrent neural network language models using spliced sentence bunch[EB]. 2014.
[45]	MIKOLOV T , KOMBRINK S , BURGET L ,et al. Extensions of recurrent neural network language model[C]// IEEE International Conference on Acoustics,May 22-27,2011,Prague,Czech Republic. Piscataway:IEEE Press, 2011: 5528-5531.
[46]	SUNDERMEYER M , SCHLUTER R , NEY H . LSTM neural networks for language modeling[EB]. 2012.
[47]	BENGIO Y , SIMARD P , FRASCONI P . Learning long term dependencies with gradient descent is difficult[J]. IEEE Transactions on Neural Networks, 1994,5(2): 157.
[48]	SAK H , SENIOR A , RAO K . Learning acoustic frame labeling for speech recognition with recurrent neural networks[C]// 2015 ICASSP,April 19-24,2015,Brisbane, QLD,Australia.[S.l.:s.n] 2015: 4280-4284.
[49]	SAK H , SENIOR A , RAO K ,et al. Fast and accurate recurrent neural network acoustic models for speech recognition[J]. arXiv:1507.06947, 2015.
[50]	SENIOR A , SAK H , QUITRY F D C ,et al. Acoustic modelling with CD-CTC-SMBR LSTM RNNS[C]// 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU),December 13-17,2015,Scottsdale,AZ,USA. Piscataway:IEEE Press, 2015: 604-609.
[51]	BAHDANAU D , CHO K , BENGIO Y . Neural machine translation by jointly learning to align and translate[J]. arXiv:1409.0473, 2014.
[52]	MNIH V , HEESS N , GRAVES A ,et al. Recurrent models of visual attention[C]// 28th Annual Conference on Neural Information Processing Systems,December 8-13,2014. Montreal,Canada.[S.l.:s.n] 2014: 2204-2212.
[53]	TUSKE Z , GOLIK P , SCHLUTER R ,et al. Acoustic modeling with deep neural networks using raw time signal for LVCSR[EB]. 2014.
[54]	SAINATH T N , WEISS R J , SENIOR A W ,et al. Learning the speech front-end with raw waveform[EB]. 2015.

Metrics

Recommended 0

No Suggested Reading articles found!

Research development and forecast of automatic speech recognition technologies

RichHTML

PDF下载

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 4

References 54

Related Articles 7

Metrics

Recommended 0

[1]	Min LU, Zehao QIN, Zhihui CHEN, Min ZHANG, Guangxue YUE. 1D-Concatenate based channel estimation DNN model optimization method [J]. Telecommunications Science, 2023, 39(4): 71-86.
[2]	Panpan LI, Zhengxia XIE, Guangxue YUE, Xin LIU. Research progress and trends of deep learning based wireless communication receiving method [J]. Telecommunications Science, 2022, 38(2): 1-17.
[3]	Shujun SUN, Shengliang PENG, Yudong YAO, Xi YANG. A survey of deep learning based modulation recognition [J]. Telecommunications Science, 2021, 37(5): 82-90.
[4]	Rui MIN. A survey of efficient deep neural network [J]. Telecommunications Science, 2020, 36(4): 115-124.
[5]	Yajie LI,Yongli ZHAO,Shoudong LIU,Jie ZHANG. Overview of research on fiber nonlinear equalization algorithm based on artificial intelligence [J]. Telecommunications Science, 2020, 36(3): 61-70.
[6]	Hansheng LIU,Hongyu TANG,Mingxia BO,Jianfeng NIU,Tianbo LI,Lingxiao LI. A multi-source threat intelligence confidence value evaluation method based on machine learning [J]. Telecommunications Science, 2020, 36(1): 119-126.
[7]	Zhen Yang,Minjie Xu,Zhangfeng Liu,Da Qin,Xiaohui Yao. Study of Audio Frequency Big Data Processing Architecture and Key Technology [J]. Telecommunications Science, 2013, 29(11): 1-5.