Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 31824
Unit Selection Algorithm Using Bi-grams Model For Corpus-Based Speech Synthesis

Authors: Mohamed Ali KAMMOUN, Ahmed Ben HAMIDA


In this paper, we present a novel statistical approach to corpus-based speech synthesis. Classically, phonetic information is defined and considered as acoustic reference to be respected. In this way, many studies were elaborated for acoustical unit classification. This type of classification allows separating units according to their symbolic characteristics. Indeed, target cost and concatenation cost were classically defined for unit selection. In Corpus-Based Speech Synthesis System, when using large text corpora, cost functions were limited to a juxtaposition of symbolic criteria and the acoustic information of units is not exploited in the definition of the target cost. In this manuscript, we token in our consideration the unit phonetic information corresponding to acoustic information. This would be realized by defining a probabilistic linguistic Bi-grams model basically used for unit selection. The selected units would be extracted from the English TIMIT corpora.

Keywords: Unit selection, Corpus-based Speech Synthesis, Bigram model

Digital Object Identifier (DOI):

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1296


[1] T. Dutoit (1999). A Short Introduction to Text-To-Speech Synthesis. TTS research Team, TCTS Lab.,Facult'e polytechnique de Mons, 2004.
[2] J. Schroeter. Text-To-Speech (TTS) Synthesis. Circuits, Signals, Speech and Image Processing.
[3] A.J. Hunt and A.W. Black (1996). Unit selection in a concatenative speech synthesis system using a large speech database. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Atlanta, GA, pp. 373-376.
[4] T. Toda, H. Kawai, M. Tsuzaki, and K. Shikano (2002). Unit Selection for Japanese Speech Synthesis Based on Both Phoneme Unit and Diphone Unit. In Proc. of ICASSP, vol. 1, pp. 465-468, May 2002.
[5] A. Breen and P. Jackson, P. (1988). Non-Uniform Unit Selection and the Similarity Metric Within BT-s LAUREATE TTS System. 3rd ESCA Int. Workshop, November 1998.
[6] R. Prudon, and C. Alessandro (2001). A Selection/Concatenation Testto- Speech System: Databases Development, System Design, Comparative Evaluation. 4th ISCA Tutorial and Research Workshop on Speech Synthesis, September 2001.
[7] G.R.W. Yi and J. Glass (2002). Information-Theoretic Criteria for Unit Selection Synthesis. In Proc. of ICSLP, pp. 2617-2620, September 2002.
[8] M. Lee, D.P. Lopresti and J.P. Olive (2001). A Text-to-Speech Platform for Variable Length Optimal Unit Searching Using Perceptual Cost Functions. 4th ISCA Tutorial and Research Workshop on Speech Synthesis, September 2001.
[9] H. Peng, Y. Zhao, and M. Chu (2002). Perceptually Optimizing the Cost Function for Unit Selection in TTS System With one Single Run of MOS Evaluation. In Proc. of ICSLP, pp. 2613-2616, September 2002.
[10] R.E. Donovan and E.M. Eide(1998). The IBM Trainable Speech Synthesis System. In Proc. of ICSLP, 1998.
[11] T. Nomura, H. Mizuno and H. Sato, H. (1990). Speech Synthesis by Optimum Concatenation of Phoneme Segments. 1st ESCA-IEEE Tutorial and Research Workshop on Speech Synthesis, pp. 39-42, 1990.
[12] Y. Pantazis, Y. Stylianou and E. Klabbers, E. (2005). Discontinuity Detection in Concatenated Speech Synthesis Based on Nonlinear Speech Analysis. In Proc. of Interspeech, 2005.
[13] A.J. Viterbi (1967). Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Transactions on Information Theory IT-13, 260-269.
[14] G.D. Forney (1973). The viterbi algorithm. Proceedings of the IEEE 61, 268-278.
[15] T. Dutoit (2004). TTSBOX 1.0: A Matlab toolbox for teaching Text-TOSpeech Synthesis. Facult'e polytechnique de Mons, 2004.
[16] T. Dutoit and M. Cernˇak (2005). TTSBOX : A Matlab toolbox for teaching Text-To-Speech Synthesis. IEEE-ICASSP, 2005.
[17] S.F. Chen and J. Goodman (1998). An empirical study of smoothing techniques for language modeling. Center for Research in Computing Technology, Harvard University, Cambridge, Massachusetts, 1998.