Continuous Feature Adaptation for Non-Native Speech Recognition

Y. Deng; X. Li; C. Kwan; B. Raj; R. Stern

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 33123

Continuous Feature Adaptation for Non-Native Speech Recognition

Authors: Y. Deng, X. Li, C. Kwan, B. Raj, R. Stern

Abstract:

The current speech interfaces in many military applications may be adequate for native speakers. However, the recognition rate drops quite a lot for non-native speakers (people with foreign accents). This is mainly because the nonnative speakers have large temporal and intra-phoneme variations when they pronounce the same words. This problem is also complicated by the presence of large environmental noise such as tank noise, helicopter noise, etc. In this paper, we proposed a novel continuous acoustic feature adaptation algorithm for on-line accent and environmental adaptation. Implemented by incremental singular value decomposition (SVD), the algorithm captures local acoustic variation and runs in real-time. This feature-based adaptation method is then integrated with conventional model-based maximum likelihood linear regression (MLLR) algorithm. Extensive experiments have been performed on the NATO non-native speech corpus with baseline acoustic model trained on native American English. The proposed feature-based adaptation algorithm improved the average recognition accuracy by 15%, while the MLLR model based adaptation achieved 11% improvement. The corresponding word error rate (WER) reduction was 25.8% and 2.73%, as compared to that without adaptation. The combined adaptation achieved overall recognition accuracy improvement of 29.5%, and WER reduction of 31.8%, as compared to that without adaptation.

Keywords: speaker adaptation; environment adaptation; robust speech recognition; SVD; non-native speech recognition

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1329829

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3220

References:

[1] B. R. Ramakrishnan, Recognition of Incomplete Spectrograms for Robust Speech Recognition, Ph.D. dissertation, Dept. Electrical and Computer Engineering, Carnegie Mellon University, 2000.
[2] Z. Wang, T. Schultz, A. Waibel, "Comparison of acoustic model adaptation techniques on non-native speech" IEEE Int.. Conf. Acoust. Speech Signal Process (ICASSP), 2003.
[3] S.V., Milner, B.P, "Noise-adaptive hidden Markov models based on Wiener filters", Proc. European Conf. Speech Technology, Berlin, 1993, Vol. II, pp.1023-1026.
[4] "Acoustical and Environmental Robustness in Automatic Speech Recognition". A. Acero. Ph. D.Dissertation, ECE Department, CMU, Sept. 1990.
[5] Nadas, A., Nahamoo, D. and Picheny, M.A, "Speech recognition using noise-adaptive prototypes", IEEE Trans. Acoust. Speech Signal Process. Vol.37, No. 10, pp-1495- 1502, 1989.
[6] Mansour, D. and Juang, B.H, "The short-time modified coherence representation and its application for noisy speech recognition", Proc. IEEE Int.. Conf. Acoust. Speech Signal Process., New York, April 1988.
[7] S. Chakrabartty, Y. Deng and G. Cauwenberghs, "Robust Speech Feature Extraction by Growth Transformation in Reproducing Kernel Hilbert Space," Proc. IEEE Int. Conf. Acoustics Speech and Signal Processing (ICASSP'2004), Montreal Canada, May 17-21, 2004.
[8] Ghitza, O., "Auditory nerve representation as a basis for speech processing", in Advances in Speech Signal Processing, ed. by S. Furui and M.M.Sondhi (Marcel Dekker, New York), Chapter 15, pp.453-485.
[9] H. Hermansky, "Perceptual linear predictive (PLP) analysis of speech", J. Acoustic Soc. Am., vol. 87, no. 4, pp. 1738-1752, Apr. 1990.
[10] Y. Deng, S. Chakrabartty, and G. Cauwenberghs, "Analog Auditory Perception Model for Robust Speech Recognition," Proc. IEEE Int. Joint Conf. on Neural Network (IJCNN'2004), Budapest Hungary, July 2004.
[11] F.H. Liu, R.M. Stern, X. Huang, A. Acero, "Efficient Cepstral Normalization for Robust Speech Recognition", Proceedings of ARPA Speech and Natural Language Workshop, 1993.
[12] S. Wegmann, D. McAllaster, J. Orloff, B. Peskin, "Speaker normalization on conversational telephone speech", Proc. ICASSP, 1996.
[13] C. J. Leggetter, P. C. Woodland, "Speaker adaptation of HMMs using linear regression", Technical Report CUED/F-INFENG/ TR. 181, Cambridge University, 1994.
[14] D. Giuliani, M. Gerosa, F. Brugnara, "Speaker Normalization through Constrained MLLR Based Transforms", International Conference on Spoken Language Processing, ICSLP, 2004.
[15] C.H. Lee, J.L. Gauvain, "Speaker adaptation based on MAP estimation of HMM parameters", Acoustics, Speech, and Signal Processing, ICASSP, 1993.
[16] V. Doumpiotis, S. Tsakalidis, and W. Byrne. "Discriminative linear transforms for feature normalization and speaker adaptation in HMM estimation", IEEE Transactions on Speech and Audio Processing, 13(3), May 2005.
[17] G. Saon, G. Zweig and M. Padmanabhan, "Linear feature space projections for speaker adaptation", ICASSP 2001, Salt Lake City, Utah, 2001.
[18] Brand, M., "Incremental singular value decomposition of uncertain data with missing values", Proceedings, European Conference on Computer Vision, ECCV, 2002.
[19] Ed. F. Deprettere, SVD and Signal Processing: Algorithms, Analysis and Applications, Elsevier Science Publishers, North Holland, 1988.
[20] K. Hermus, I. Dologlou, P. Wambacq and D. V. Compernolle. "Fully Adaptive SVD-Based Noise Removal for Robust Speech Recognition", In Proc. European Conference on Speech Communication and Technology, volume V, pages 1951--1954, Budapest, Hungary, September 1999.
[21] L. F. Uebel and P. C. Woodland, "Improvements in linear transforms based speaker adaptation," in ICASSP, 2001.
[22] T. Anastasakos, J. McDonough, R. Schwartz, etc, "A compact model for speaker-adaptive training," in ICSLP, 1996.
[23] P. C. Woodland and D. Povey, "Large scale discriminative training for speech recognition," in Proceedings of the Tutorial and Research Workshop on Automatic Speech Recognition. ISCA, 2000.
[24] L. Benarousse, E. Geoffrois, J. Grieco, R. Series, etc,, "The NATO Native and Non-Native (N4) Speech Corpus", in Proceedings Workshop on Multilingual Speech and Language Processing, Aalborg, Denmark, 2001.
[25] M. K. Ravishankar, ''Sphinx-3 s3.3 Decoder", Sphinx Speech Group, CMU.
[26] P Beyerlein, X Aubert, R Haeb-Umbach, M Harris, "Large vocabulary continuous speech recognition of Broadcast News-The Philips/RWTH approach", Speech Communication, 2002.
[27] V.R. Gadde, A. Stolcke, D. Vergyri, J. Zheng, K. Sonmez, "Building an ASR System for Noisy Environments: SRI-s 2001 SPINE Evaluation System", Proceedings of ICSLP, 2002.
[28] S. M. Katz, "Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer," in IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. 35(3), pp. 400-401, March, 1987.
[29] D. Povey, P.C. Woodland, M.J.F. Gales, "Discriminative MAP for acoustic model adaptation", Proc. ICASSP, 2003.
[30] J. Stadermann and G. Rigoll, "Two-stage speaker adaptation of hybrid tied-posterior acoustic models," in ICASSP, 2005.
[31] P. Kenny, G. Boulianne, P. Dumouchel, "Eigenvoice Modeling with Sparse Training Data", IEEE Transactions on Speech and Audio Processing, 2005.
[32] T. Robinson, J. Fransen, D. Pye, J. Foote, S. Renals, "Wsjcam0: A British English Speech Corpus for Large Vocabulary Continuous Speech Recognition ", Proc. ICASSP, 1995.