Slovenian Text-to-Speech Synthesis for Speech User Interfaces
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 32799
Slovenian Text-to-Speech Synthesis for Speech User Interfaces

Authors: Jerneja Žganec Gros, Aleš Mihelič, Nikola Pavešić, Mario Žganec, Stanislav Gruden

Abstract:

The paper presents the design concept of a unitselection text-to-speech synthesis system for the Slovenian language. Due to its modular and upgradable architecture, the system can be used in a variety of speech user interface applications, ranging from server carrier-grade voice portal applications, desktop user interfaces to specialized embedded devices. Since memory and processing power requirements are important factors for a possible implementation in embedded devices, lexica and speech corpora need to be reduced. We describe a simple and efficient implementation of a greedy subset selection algorithm that extracts a compact subset of high coverage text sentences. The experiment on a reference text corpus showed that the subset selection algorithm produced a compact sentence subset with a small redundancy. The adequacy of the spoken output was evaluated by several subjective tests as they are recommended by the International Telecommunication Union ITU.

Keywords: text-to-speech synthesis, prosody modeling, speech user interface.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1055797

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1397

References:


[1] A.W. Black and K.A. Lenzo, "Flite: a small fast run-time speech synthesis engine," In Proceedings of the 4th ISCA Workshop on Speech Synthesis, 2001, pp. 204-207.
[2] M.L. Tomokoyo, W.A. Black and K.A. Lenzo, "Arabic in my hand: small footprint synthesis of Egyptian Arabic," In Proceedings of the Eurospeech-03, Geneva, Switzerland, 2003, pp. 2049-2052.
[3] T. Šef and M. Gams, "Speaker (GOVOREC): a complete Slovenian textto speech system," International journal on speech technologies, vol.6, 2003, pp. 277-287.
[4] N. Pave┼íić, J. Gros, S. Dobri┼íek and F. Miheli─ì, "Homer II - man - machine interface to internet for blind and visually impaired people,". Computer communications, 2003, vol. 26, pp. 438-443.
[5] B. Vesnicer and F. Miheli─ì, "Evaluation of the Slovenian HMM-based speech synthesis system," Proc. TSD'04, Lecture notes in computer science, vol. 1692, Berlin, Springer Verlag, 2004, pp. 513-520.
[6] J. Gros, F. Miheli─ì, N. Pave┼íić, M. Žganec, A. Miheli─ì, M. Knez, A. Mer─ìun and D. ┼ákerl, "The phonectic SMS reader," Proc. TSD'01, Lecture notes in computer science, vol. 1692, Springer Verlag, Berlin, 2001, pp. 334-340.
[7] N. Campbell, "CHATR: a high-definition speech resequencing system," In Proceedings of the 3rd ASA/ASJ Joint Meeting, 1996, pp. 1223-1228.
[8] M. Beutnagel, A. Conkie, J. Schroeter and Y. Stylianou, "The AT&T Next-Gen TTS System," in Proceedings of the 137th Meeting of the Acoustic Society of America, 2000.
[9] B. Möbius, "The Bell Labs German text-to-speech system," Computer Speech and Language, vol. 13, 1999, pp. 319-358.
[10] J. Meron and P. Veprek, "Compression of exception lexicons for small footprint grapheme-to-phoneme conversion," In Proceedings of the ICASSP-05, Philadelphia, USA, March 18-23, 2005.
[11] J. Gros, N. Pave┼íić and F. Miheli─ì, "Syllable and segment duration at different speaking rates for the Slovenian language," in Proceedings of the Eurospeech-97, Rhodes, Greece, 1997, pp. 1-4.
[12] J. Gros, N. Pave┼íić and F. Miheli─ì, "Speech timing in Slovenian TTS", in Proceedings of the Eurospeech-97, Rhodes, Greece, 1997, pp. 323- 326.
[13] A. Conkie, "Robust unit selection system for speech synthesis," in Proceedings of the Eurospeech'99, Budapest, Hungary, 1999.
[14] M. Beutnagel, R. Mohri and M. Riley, "Rapid unit selection from a large speech corpus for concatenative speech synthesis," in Proceedings of the Eurospeech '99, Budapest, Hungary, 1999.
[15] J. Tian, J. Nurminen and I. Kiss, "Optimal subset selection from text databases," In Proceedings of the ICASSP-05, Philadelphia, USA, March 18-23, 2005.
[16] J.P.H. Van Santen, "Methods for optimal text selection," In Proceedings of the Eurospeech-97, Rhodes, Greece, 1997, pp. 553-556.
[17] H. Kawai, S. Yamamoto and T. Shimizu, "A design method of speech corpus for text-to-speech synthesis taking into account prosody," in Proceedings of the ICSLP-00, 2000, pp. 420-425.
[18] C. Kuo and J. Huang, "Efficient and scalable methods for text script generation in corpus-based TTS design," in Proceedings of the ICSLP-02, 2002, pp. 121-124.
[19] B. Bozkurt, O. Ozturk and T. Dutoit, "Text design for TTS speech corpus building using a modified greedy selection," in Proceedings of the Eurospeech-03, Geneva, Switzerland, 2003, pp. 277-180.
[20] M. Isogai, M. Mizuno and K. Mano, "Recording script design for corpus-based TTS system based on coverage of various phonetic elements," In Proceedings of the ICASSP-05, Philadelphia, USA, March 18-23, 2005.
[21] F. Malfrère and T. Dutoit, "High quality speech synthesis for phonetic speech segmentation," In Proceedings of the Eurospeech-97, Rhodes, Greece, 1997, pp. 2631-2634.
[22] F. Miheli─ì, J. Gros, S. Dobri┼íek, J. Žibert and N. Pave┼íić, "Spoken language resources at LUKS of the University of Ljubljana," International Journal on Speech Technologies, vol. 6, no. 3, 2003, pp. 221-232.
[23] G. Xydas and G. Kouroupetroglou, "An intonation model for embedded devices based on natural F0 samples," In Proceedings of the Interspeech-04, Korea, 2004, pp. 801-804.
[24] ITU, "A method for subjective performance assessment of the quality of speech voice output devices," ITU-T Recommendation P.85, ITU, 1994.
[25] ITU, "Telephone transmission quality subjective opinion tests - Modulated noise reference unit," ITU-T Recommendation P.81, ITU, Blue Book, (5), pp. 1-5, 1993.
[26] J. Gros, F. Miheli─ì and N. Pave┼íić, "Slovene interactive text-to-speech evaluation site - SITES," Proc. TSD'99, Lecture notes in computer science, vol. 1692, Berlin, Springer Verlag, 1999, pp. 223-228.