Using Teager Energy Cepstrum and HMM distancesin Automatic Speech Recognition and Analysis of Unvoiced Speech

Panikos Heracleous

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 33132

Using Teager Energy Cepstrum and HMM distancesin Automatic Speech Recognition and Analysis of Unvoiced Speech

Authors: Panikos Heracleous

Abstract:

In this study, the use of silicon NAM (Non-Audible Murmur) microphone in automatic speech recognition is presented. NAM microphones are special acoustic sensors, which are attached behind the talker-s ear and can capture not only normal (audible) speech, but also very quietly uttered speech (non-audible murmur). As a result, NAM microphones can be applied in automatic speech recognition systems when privacy is desired in human-machine communication. Moreover, NAM microphones show robustness against noise and they might be used in special systems (speech recognition, speech conversion etc.) for sound-impaired people. Using a small amount of training data and adaptation approaches, 93.9% word accuracy was achieved for a 20k Japanese vocabulary dictation task. Non-audible murmur recognition in noisy environments is also investigated. In this study, further analysis of the NAM speech has been made using distance measures between hidden Markov model (HMM) pairs. It has been shown the reduced spectral space of NAM speech using a metric distance, however the location of the different phonemes of NAM are similar to the location of the phonemes of normal speech, and the NAM sounds are well discriminated. Promising results in using nonlinear features are also introduced, especially under noisy conditions.

Keywords: Speech recognition, unvoiced speech, nonlinear features, HMM distance measures

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1055942

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1652

References:

[1] Y. Nakajima, H. Kashioka, K. Shikano, N. Campbell, "Non-Audible Murmur Recognition Input Interface Using Stethoscopic Microphone Attached to the Skin", Proceedings of ICASSP, pp. 708-711, 2003.
[2] Y. Zheng, Z. Liu, Z. Zhang, M. Sinclair, J. Droppo, L. Deng, A. Acero, Z. Huang, "Air- and Bone-Conductive Integrated Microphones for Robust Speech Detection and Enhancement", Proceedings of ASRU, pp. 249-253, 2003.
[3] Z. Liu, A. Subramaya, Z. Zhang, J. Droppo, A. Acero, "Leakage Model and Teeth Clack Removal for Air- and Bone-conductive Integrated Microphones " Proceedings of ICASSP, pp. 1093-1096, 2005.
[4] M. Graciarena, H. Franco, K. Sonmez, H. Bratt, "Combining Standard and Throat Microphones for Robust Speech Recognition", IEEE Signal Processing Letters, Vol. 10, No 3, pp.72-74, 2003.
[5] O. M. Strand, T. Holter, A. Egeberg, S. Stensby, "On the Feasility of ASR in Extreme Noise Using the Parat Earplug Communication Terminal" Proceeding of ASRU, pp. 315-320, 2003.
[6] S. C. Jou, T. Schultz, Alex Weibel, "Adaptation for Soft Whisper Recognition Using a Throat Microphone", Proceedings of ICSLP, 2004.
[7] P. Heracleous, T. Kaino, H. Saruwatari, and K. Shikano, "Applications of NAM Microphones in Speech Recognition for Privacy in Human-machine Communication," Proceedings of Interspeech2005-EUROSPEECH, pp. 3041-3044, 2005.
[8] Junqua J-C, "The Lombard Reflex and its Role on Human Listeners and Automatic Speech Recognizers," J. Acoust. Soc. Am., Vol. 1 pp. 510-524, 1993.
[9] A. Wakao, K. Takeda, F. Itakura, "Variability of Lombard Effects Under Different Noise Conditions", Proceedings of ICSLP, pp. 2009-2012, 1996.
[10] J.H.L. Hansen, "Morphological Constrained Feature Enhancement with Adaptive Cepstral Compensation (MCE-ACC) for Speech Recognition in Noise and Lombard Effect", IEEE Trans. Speech Audio Proc. vol. 2, pp. 598-614, 1994.
[11] B.A. Hanson, T. Applebaum, "Robust Speaker-independent Word Recognition Using Instantaneous Dynamic and Acceleration Features: Experiments with Lombard and Noisy Speech", Proceedings of ICASSP, pp. 857-860, 1990.
[12] R. Ruiz, B. Harmegnies, C. Legros, D. Poch, "Time- and Spectrum Related Variabilities in Stressed Speech Under Laboratory and Real Conditions", Speech Communication vol. 20, pp. 111-129, 1996.
[13] P. Heracleous, T. Kaino, H. Saruwatari, and K. Shikano,"Investigating the Role of the Lombard Reflex in Non-Audible Murmur (NAM) Recognition," Proceedings of Interspeech2005-EUROSPEECH, pp. 2649-2652, 2005.
[14] G. Zhou, J.H.L. Hansen, and J.F. Kaiser, "Classification of Speech under Stress Based on Features Derived from the Nonlinear Teager Energy Operator," IEEE ICASSP-98, vol. 1, pp. 549-552, 1998.
[15] M. Nakamura, K. Iwano, and S. Furui, "Analysis of Spectral Reduction in Spontaneous Speech and its Effects on Speech Recognition Performances,", Proceedings of Interspeech2005-EUROSPEECH, pp. 3381- 3384, 2005.
[16] T. Kawahara et al., "Free Software Toolkit for Japanese Large Vocabulary Continuous Speech Recognition", Proceedings of ICSLP, pp. IV- 476-479, 2000.
[17] K. Itou et al., "JNAS: Japanese Speech Corpus for Large Vocabulary Continuous Speech Recognition Research", The Journal of Acoustical Society of Japan (E), Vol. 20, pp. 199-206, 1999.
[18] C. J. Leggetter, C. Woodland, "Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models", Computer Speech and Language, Vol. 9, pp. 171-185, 1995.
[19] C.H. Lee, C.H. Lin, and B.H. Juang, "A study on speaker adaptation of the parameters of continuous density hidden Markov models", IEEE transactions Signal Processing, Vol. 39, pp. 806-814, 1991.
[20] P.C. Woodland, D. Pye, M.J.F. Gales, "Iterative Unsupervised Adaptation Using Maximum Likelihood Linear Regression", Proceedings of ICSLP, pp. 1133-1136, 1996.
[21] B.-H. Juang, and L. Rabiner, "A Probabilistic Distance Measure for Hidden Markov Models", AT&T Technical Journal, pp. 391-408, 1985.
[22] D. Dimitriadis, P. Maragos, and A. Potamianos, "Auditory Teager Energy Cepstrum Coefficients for Robust Speech Recognition," Proceeding of Interspeech2005-EUROSPEECH, pp. 3013-3016, 2005.
[23] R. D.Patterson, and J. Holdsworth, "A Functional Model of Neural Activity Patterns and Auditory Images," Advances in speech, Hearing and Language Processing, vol.3, JAI Press, London, 1991.