A Two-Stage Adaptation towards Automatic Speech Recognition System for Malay-Speaking Children
Authors: Mumtaz Begum Mustafa, Siti Salwah Salim, Feizal Dani Rahman
Abstract:
Recently, Automatic Speech Recognition (ASR) systems were used to assist children in language acquisition as it has the ability to detect human speech signal. Despite the benefits offered by the ASR system, there is a lack of ASR systems for Malay-speaking children. One of the contributing factors for this is the lack of continuous speech database for the target users. Though cross-lingual adaptation is a common solution for developing ASR systems for under-resourced language, it is not viable for children as there are very limited speech databases as a source model. In this research, we propose a two-stage adaptation for the development of ASR system for Malay-speaking children using a very limited database. The two stage adaptation comprises the cross-lingual adaptation (first stage) and cross-age adaptation. For the first stage, a well-known speech database that is phonetically rich and balanced, is adapted to the medium-sized Malay adults using supervised MLLR. The second stage adaptation uses the speech acoustic model generated from the first adaptation, and the target database is a small-sized database of the target users. We have measured the performance of the proposed technique using word error rate, and then compare them with the conventional benchmark adaptation. The two stage adaptation proposed in this research has better recognition accuracy as compared to the benchmark adaptation in recognizing children’s speech.
Keywords: Automatic speech recognition system, children speech, adaptation, Malay.
Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1112222
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1751References:
[1] Le, V-B., Besacier, L. (2009). Automatic Speech Recognition for Under-Resourced Languages: Application to Vietnamese Language. Audio, Speech, and Language Processing, IEEE Transactions, 17(8), 1471-1482. doi: 10.1109/TASL.2009.2021723.
[2] Warschauer, M. (2013). Comparing face-to-face and electronic discussion in the second language classroom. CALICO journal, 13(2-3), 7-26.
[3] Benzeghiba, M., De Mori, R., Deroo, O., Dupont, S., Erbes, T., Jouvet, D., & Rose, R. (2007). Automatic speech recognition and speech variability: A review. Speech Communication, 49(10), 763-786.
[4] Mustafa, M. B., Don, Z. M., Ainon, R. N., Zainuddin, R., & Knowles, G. (2014). Developing an HMM-Based Speech Synthesis System for Malay: A Comparison of Iterative and Isolated Unit Training. IEICE TRANSACTIONS on Information and Systems, 97(5), 1273-1282.
[5] Besacier, L., Barnard, E., Karpov, A., & Schultz, T. (2014). Automatic speech recognition for under-resourced languages: A survey. Speech Communication, 56, 85-100.
[6] Plowman, L. (1999). Using video for observing interaction in the classroom. Edinburgh: Scottish Council for Research in Education.
[7] Gerosa, M., Giuliani, D., & Brugnara, F. (2009). Towards age-independent acoustic modeling. Speech Communication, 51(6), 499-509.