Automatic Distance Compensation for Robust Voice-based Human-Computer Interaction

Randy Gomez; Keisuke Nakamura; Kazuhiro Nakadai

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 33156

Automatic Distance Compensation for Robust Voice-based Human-Computer Interaction

Authors: Randy Gomez, Keisuke Nakamura, Kazuhiro Nakadai

Abstract:

Distant-talking voice-based HCI system suffers from performance degradation due to mismatch between the acoustic speech (runtime) and the acoustic model (training). Mismatch is caused by the change in the power of the speech signal as observed at the microphones. This change is greatly influenced by the change in distance, affecting speech dynamics inside the room before reaching the microphones. Moreover, as the speech signal is reflected, its acoustical characteristic is also altered by the room properties. In general, power mismatch due to distance is a complex problem. This paper presents a novel approach in dealing with distance-induced mismatch by intelligently sensing instantaneous voice power variation and compensating model parameters. First, the distant-talking speech signal is processed through microphone array processing, and the corresponding distance information is extracted. Distance-sensitive Gaussian Mixture Models (GMMs), pre-trained to capture both speech power and room property are used to predict the optimal distance of the speech source. Consequently, pre-computed statistic priors corresponding to the optimal distance is selected to correct the statistics of the generic model which was frozen during training. Thus, model combinatorics are post-conditioned to match the power of instantaneous speech acoustics at runtime. This results to an improved likelihood in predicting the correct speech command at farther distances. We experiment using real data recorded inside two rooms. Experimental evaluation shows voice recognition performance using our method is more robust to the change in distance compared to the conventional approach. In our experiment, under the most acoustically challenging environment (i.e., Room 2: 2.5 meters), our method achieved 24.2% improvement in recognition performance against the best-performing conventional method.

Keywords: Human Machine Interaction, Human Computer Interaction, Voice Recognition, Acoustic Model Compensation, Acoustic Speech Enhancement.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1087165

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1892

References:

[1] ”http://www.gartner.com” Information technology research and advisory company
[2] R. Gomez, T. Kawahara, K. Nakamura and K. Nakadai “Multi-party Human-Robot Interaction with Distant-Talking Speech Recognition” In Proceedings IEEE Human Robot Interaction, 2012
[3] M. Seltzer, “Speech-Recognizer-Based Optimization for Microphone Array Processing” IEEE Signal Processing Letters, Vol. 10, No. 3, 2003
[4] M. Seltzer and R. Stern, “Subband Likelihood-Maximizing Beamforming for Speech Recognition in Reverberant Environments” IEEE Trans. on Audio, Speech, and Lang. Proc., Vol. 14, No. 6, 2006
[5] The HTK documentation http://htk.eng.cam.ac.uk/docs/docs.shtml
[6] Kaifu Lee “Automatic Speech Recogntion – The Development of SPHINX System” Kluwer Academic Publishers, Boston, 1989
[7] R. Gomez, J. Even, H. Saruwatari, and K. Shikano, “Rapid Unsupervised Speaker Adaptation Robust in Reverberant Environment Conditions” In Proceedings Interspeech, 2008
[8] L. Lee and R. Rose, “Speaker Normalization using Efficient Frequency Warping Procedures” In Proceedings IEEE Int. Conf. Acoust., Speech, Signal Proc. ICASSP, pp 353-356, 1996
[9] D.Pye and P.C.Woodland “Experiments in Speaker Normalisation and Adaptation for Large Vocabulary Speech Recognition” In Proceedings IEEE Int. Conf. Acoust., Speech, Signal Proc. ICASSP, pp 1047-1050, 1997
[10] A. Baba, S. Yoshizawa, A. Lee, H. Saruwatari, and K. Shikano, “Elderly Acoustic Model fro Large Vocabulary Continuous Speech Recogntion” In Proceedings EUROSPEECH, 2001
[11] C. Huang, T. Chen, S. Li and JL. Zhou “Analysis of Speaker Variability” In Proceedings EUROSPEECH, 2001
[12] D. Pye and P.C. Woodland “Experiments in Speaker Normalisation and Adaptation for Large Vocabulary Adaptation” In Proceedings IEEE Int. Conf. Acoust., Speech, Signal Proc. ICASSP, 1997
[13] Guiliani and Gerosa “Invetsigating Recognition of Children’s Speech ” In Proceedings IEEE Int. Conf. Acoust., Speech, Signal Proc. ICASSP, 2003
[14] R. Gomez and T. Kawahara ”Denoising Using Optimized Wavelet Filtering for Automatic Speech Recognition” In Proceedings Interspeech, 2011
[15] K. Kinoshita , T. Nakatani and M. Miyoshi, “Efficient Blind Dereverberation Framework for Automatic Speech Recognition” In Proceedings Interspeech, 2005
[16] K. Kinoshita , T. Nakatani and M. Miyoshi, “Spectral Subtraction Steered By Multi-step Forward Linear Prediction For Single Channel Speech Dereverberation” In Proceedings IEEE Int. Conf. Acoust., Speech, Signal Proc. ICASSP, 2006
[17] R. Gomez, J. Even, H. Saruwatari, and K. Shikano , “Distant-talking Robust Speech Recognition Using Late Reflection Components of Room Impulse Response” In Proceedings IEEE Int. Conf. Acoust., Speech, Signal Proc. ICASSP, 2008
[18] R. Gomez, J. Even, H. Saruwatari, and K. Shikano, “Fast Dereverberation for Hands-Free Speech Recognition” IEEE Workshop HSCMA, 2008
[19] H. Kuttruff, “Room Acoustics” Spon Press, 2000
[20] P. Naylor and N. Gaubitch, “Speech Dereverberation” In Proceedings IWAENC, 2005
[21] Y. Huang, J. Benesty, and J. Chen, “Speech acquisition and enhancement in a reverberant, cocktail-party-like environment” In Proceedings IEEE Int. Conf. Acoust., Speech, Signal Proc. ICASSP, 2008
[22] G. Gannot and M. Moonen, “Subspace Methods for Multimicrophone Speech Dereverberation” In Proceedings Eurasip Journal on Applied Signal Processing, vol. E80-A pp 804-808, 1997
[23] T. Hikichi, M. Delcroix, and M. Miyoshi, “Inverse Filtering for Speech Dereverberation Less Sensitive to Noise and Room Transfer Function Fluctuations” In Proceedings Eurasip Journal on Advances in Signal Processing, vol. 2007
[24] H. Attias, J. Platt, A. Acero, and L. Deng, “Speech Denoising and Dereverberation Using Probabilistic Models” In Proceedings MIT Press In Advances in Neural Information Processing Systems 13, 2001
[25] T. Nakatani, B-H. Juang, T. Yoshioka, K. Kinoshita, M. Delcroix, and M. Miyoshi, “Speech Dereverberation Based on Maximum-Likelihood Estimation with Time-Varying Gaussian Source Model” In Proceedings IEEE Trans. on Audio, Speech, and Lang. Proc., Vol. 16, No. 8, 2008
[26] R. Gomez, T. Kawahara, K. Nakamura and K. Nakadai, ”Robust handsfree Automatic Speech Recognition for human-machine interaction” In Proceedings IEEE Humanoids, 2010
[27] H. Sawada et al.,“Polar coordinate based nonlinear function for frequency-domain blind source separation,” in Proc. of ICASSP 2002, 2002
[28] H. Nakajima, K. Nakadai, Y. Hasegawa and H. Tsujino, “Adaptive Step-size Parameter Control for real World Blind Source Separation” In Proceedings IEEE Int. Conf. Acoust., Speech, Signal Proc. ICASSP, 2008
[29] Akinobu Lee ”JULIUS: A Free Continuous Speech Recognition Software” www.sourceforge.jp Kyoto University, Japan
[30] L.R.Rabiner and B. Gold, ”Theory and Application of Digital Signal Processing” Prentice Hall, Englewood Cliffs 1975
[31] L.R.Rabiner and R.W. Scahefer, ”Digital Processing of Speech Signals” Prentice Hall, Englewood Cliffs 1978
[32] L.R.Rabiner and B.H. Juang , ”Fundamentals of Speech Recognition” Prentice Hall, Englewood Cliffs 1993
[33] C.H. Lee, L.R. Rabiner, R. Pieraccini and J.G. Wilpon ”Acoustic Modelling for Large Vocabulary Speech Recognition” In Proceedings Computer Speech and Language, 1990
[34] T. Cincarek, H. Kawanami, H. Saruwatari, and K. Shikano,”Development and portability of ASR and Q and A modules for real-environment speech-oriented guidance systems” In Proceedings IEEE Automatic Speech Recognition and Understanding ASRU, 2007
[35] S. Takeuchi, T. Cincarek, H. Kawanami, H. Saruwatari, and K. Shikano,”Question and answer database optimization using speech recognition results” In Proceedings Interspeech, 2008
[36] Y. Suzuki, F. Asano, H.-Y. Kim, and T. Sone, ”An optimum computergenerated pulse signal suitable for the measurement of very long impulse responses” Journal Acoustical Society of America, 1995
[37] H.-G. Hirsch and H. Finster, “A new approach for the adaptation of HMMs to reverberation and background noise” In Proceeding Speech Communication, pp 244-263, 2008
[38] R. Gomez, K. Nakamura and K. Nakadai, ”Hands-free Human-Robot Communication Robust to Speaker’s Radial Position” In Proceeding IEEE International Conference on Robots and Automation ICRA, 2013