Automatic Distance Compensation for Robust Voice-based Human-Computer Interaction
Authors: Randy Gomez, Keisuke Nakamura, Kazuhiro Nakadai
Abstract:
Distant-talking voice-based HCI system suffers from performance degradation due to mismatch between the acoustic speech (runtime) and the acoustic model (training). Mismatch is caused by the change in the power of the speech signal as observed at the microphones. This change is greatly influenced by the change in distance, affecting speech dynamics inside the room before reaching the microphones. Moreover, as the speech signal is reflected, its acoustical characteristic is also altered by the room properties. In general, power mismatch due to distance is a complex problem. This paper presents a novel approach in dealing with distance-induced mismatch by intelligently sensing instantaneous voice power variation and compensating model parameters. First, the distant-talking speech signal is processed through microphone array processing, and the corresponding distance information is extracted. Distance-sensitive Gaussian Mixture Models (GMMs), pre-trained to capture both speech power and room property are used to predict the optimal distance of the speech source. Consequently, pre-computed statistic priors corresponding to the optimal distance is selected to correct the statistics of the generic model which was frozen during training. Thus, model combinatorics are post-conditioned to match the power of instantaneous speech acoustics at runtime. This results to an improved likelihood in predicting the correct speech command at farther distances. We experiment using real data recorded inside two rooms. Experimental evaluation shows voice recognition performance using our method is more robust to the change in distance compared to the conventional approach. In our experiment, under the most acoustically challenging environment (i.e., Room 2: 2.5 meters), our method achieved 24.2% improvement in recognition performance against the best-performing conventional method.
Keywords: Human Machine Interaction, Human Computer Interaction, Voice Recognition, Acoustic Model Compensation, Acoustic Speech Enhancement.
Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1087165
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1883References:
[1] ”http://www.gartner.com” Information technology research and advisory
company
[2] R. Gomez, T. Kawahara, K. Nakamura and K. Nakadai “Multi-party
Human-Robot Interaction with Distant-Talking Speech Recognition” In
Proceedings IEEE Human Robot Interaction, 2012
[3] M. Seltzer, “Speech-Recognizer-Based Optimization for Microphone
Array Processing” IEEE Signal Processing Letters, Vol. 10, No. 3, 2003
[4] M. Seltzer and R. Stern, “Subband Likelihood-Maximizing Beamforming
for Speech Recognition in Reverberant Environments” IEEE Trans. on
Audio, Speech, and Lang. Proc., Vol. 14, No. 6, 2006
[5] The HTK documentation http://htk.eng.cam.ac.uk/docs/docs.shtml
[6] Kaifu Lee “Automatic Speech Recogntion – The Development of
SPHINX System” Kluwer Academic Publishers, Boston, 1989
[7] R. Gomez, J. Even, H. Saruwatari, and K. Shikano, “Rapid Unsupervised
Speaker Adaptation Robust in Reverberant Environment Conditions” In
Proceedings Interspeech, 2008
[8] L. Lee and R. Rose, “Speaker Normalization using Efficient Frequency
Warping Procedures” In Proceedings IEEE Int. Conf. Acoust., Speech,
Signal Proc. ICASSP, pp 353-356, 1996
[9] D.Pye and P.C.Woodland “Experiments in Speaker Normalisation and
Adaptation for Large Vocabulary Speech Recognition” In Proceedings
IEEE Int. Conf. Acoust., Speech, Signal Proc. ICASSP, pp 1047-1050, 1997
[10] A. Baba, S. Yoshizawa, A. Lee, H. Saruwatari, and K. Shikano, “Elderly
Acoustic Model fro Large Vocabulary Continuous Speech Recogntion” In
Proceedings EUROSPEECH, 2001
[11] C. Huang, T. Chen, S. Li and JL. Zhou “Analysis of Speaker Variability”
In Proceedings EUROSPEECH, 2001
[12] D. Pye and P.C. Woodland “Experiments in Speaker Normalisation and
Adaptation for Large Vocabulary Adaptation” In Proceedings IEEE Int.
Conf. Acoust., Speech, Signal Proc. ICASSP, 1997
[13] Guiliani and Gerosa “Invetsigating Recognition of Children’s Speech
” In Proceedings IEEE Int. Conf. Acoust., Speech, Signal Proc. ICASSP,
2003
[14] R. Gomez and T. Kawahara ”Denoising Using Optimized Wavelet
Filtering for Automatic Speech Recognition” In Proceedings Interspeech,
2011
[15] K. Kinoshita , T. Nakatani and M. Miyoshi, “Efficient Blind
Dereverberation Framework for Automatic Speech Recognition” In
Proceedings Interspeech, 2005
[16] K. Kinoshita , T. Nakatani and M. Miyoshi, “Spectral Subtraction
Steered By Multi-step Forward Linear Prediction For Single Channel
Speech Dereverberation” In Proceedings IEEE Int. Conf. Acoust., Speech,
Signal Proc. ICASSP, 2006
[17] R. Gomez, J. Even, H. Saruwatari, and K. Shikano , “Distant-talking
Robust Speech Recognition Using Late Reflection Components of Room
Impulse Response” In Proceedings IEEE Int. Conf. Acoust., Speech,
Signal Proc. ICASSP, 2008
[18] R. Gomez, J. Even, H. Saruwatari, and K. Shikano, “Fast
Dereverberation for Hands-Free Speech Recognition” IEEE Workshop
HSCMA, 2008
[19] H. Kuttruff, “Room Acoustics” Spon Press, 2000
[20] P. Naylor and N. Gaubitch, “Speech Dereverberation” In Proceedings
IWAENC, 2005
[21] Y. Huang, J. Benesty, and J. Chen, “Speech acquisition and enhancement
in a reverberant, cocktail-party-like environment” In Proceedings IEEE
Int. Conf. Acoust., Speech, Signal Proc. ICASSP, 2008
[22] G. Gannot and M. Moonen, “Subspace Methods for Multimicrophone
Speech Dereverberation” In Proceedings Eurasip Journal on Applied
Signal Processing, vol. E80-A pp 804-808, 1997
[23] T. Hikichi, M. Delcroix, and M. Miyoshi, “Inverse Filtering for Speech
Dereverberation Less Sensitive to Noise and Room Transfer Function
Fluctuations” In Proceedings Eurasip Journal on Advances in Signal
Processing, vol. 2007
[24] H. Attias, J. Platt, A. Acero, and L. Deng, “Speech Denoising and
Dereverberation Using Probabilistic Models” In Proceedings MIT Press
In Advances in Neural Information Processing Systems 13, 2001
[25] T. Nakatani, B-H. Juang, T. Yoshioka, K. Kinoshita, M. Delcroix, and
M. Miyoshi, “Speech Dereverberation Based on Maximum-Likelihood
Estimation with Time-Varying Gaussian Source Model” In Proceedings
IEEE Trans. on Audio, Speech, and Lang. Proc., Vol. 16, No. 8, 2008
[26] R. Gomez, T. Kawahara, K. Nakamura and K. Nakadai, ”Robust handsfree
Automatic Speech Recognition for human-machine interaction” In
Proceedings IEEE Humanoids, 2010
[27] H. Sawada et al.,“Polar coordinate based nonlinear function for
frequency-domain blind source separation,” in Proc. of ICASSP 2002,
2002
[28] H. Nakajima, K. Nakadai, Y. Hasegawa and H. Tsujino, “Adaptive
Step-size Parameter Control for real World Blind Source Separation” In
Proceedings IEEE Int. Conf. Acoust., Speech, Signal Proc. ICASSP, 2008
[29] Akinobu Lee ”JULIUS: A Free Continuous Speech Recognition
Software” www.sourceforge.jp Kyoto University, Japan
[30] L.R.Rabiner and B. Gold, ”Theory and Application of Digital Signal
Processing” Prentice Hall, Englewood Cliffs 1975
[31] L.R.Rabiner and R.W. Scahefer, ”Digital Processing of Speech Signals”
Prentice Hall, Englewood Cliffs 1978
[32] L.R.Rabiner and B.H. Juang , ”Fundamentals of Speech Recognition”
Prentice Hall, Englewood Cliffs 1993
[33] C.H. Lee, L.R. Rabiner, R. Pieraccini and J.G. Wilpon ”Acoustic
Modelling for Large Vocabulary Speech Recognition” In Proceedings
Computer Speech and Language, 1990
[34] T. Cincarek, H. Kawanami, H. Saruwatari, and K.
Shikano,”Development and portability of ASR and Q and A modules for
real-environment speech-oriented guidance systems” In Proceedings IEEE
Automatic Speech Recognition and Understanding ASRU, 2007
[35] S. Takeuchi, T. Cincarek, H. Kawanami, H. Saruwatari, and K.
Shikano,”Question and answer database optimization using speech
recognition results” In Proceedings Interspeech, 2008
[36] Y. Suzuki, F. Asano, H.-Y. Kim, and T. Sone, ”An optimum computergenerated
pulse signal suitable for the measurement of very long impulse
responses” Journal Acoustical Society of America, 1995
[37] H.-G. Hirsch and H. Finster, “A new approach for the adaptation of
HMMs to reverberation and background noise” In Proceeding Speech
Communication, pp 244-263, 2008
[38] R. Gomez, K. Nakamura and K. Nakadai, ”Hands-free Human-Robot
Communication Robust to Speaker’s Radial Position” In Proceeding IEEE
International Conference on Robots and Automation ICRA, 2013