Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 31793
Transformation of Vocal Characteristics: A Review of Literature

Authors: Dong-Yan Huang, Ee Ping Ong, Susanto Rahardja, Minghui Dong, Haizhou Li


The transformation of vocal characteristics aims at modifying voice such that the intelligibility of aphonic voice is increased or the voice characteristics of a speaker (source speaker) to be perceived as if another speaker (target speaker) had uttered it. In this paper, the current state-of-the-art voice characteristics transformation methodology is reviewed. Special emphasis is placed on voice transformation methodology and issues for improving the transformed speech quality in intelligibility and naturalness are discussed. In particular, it is suggested to use the modulation theory of speech as a base for research on high quality voice transformation. This approach allows one to separate linguistic, expressive, organic and perspective information of speech, based on an analysis of how they are fused when speech is produced. Therefore, this theory provides the fundamentals not only for manipulating non-linguistic, extra-/paralinguistic and intra-linguistic variables for voice transformation, but also for paving the way for easily transposing the existing voice transformation methods to emotion-related voice quality transformation and speaking style transformation. From the perspectives of human speech production and perception, the popular voice transformation techniques are described and classified them based on the underlying principles either from the speech production or perception mechanisms or from both. In addition, the advantages and limitations of voice transformation techniques and the experimental manipulation of vocal cues are discussed through examples from past and present research. Finally, a conclusion and road map are pointed out for more natural voice transformation algorithms in the future.

Keywords: Voice transformation, Voice Quality, Emotion, Individuality, Speaking Style, Speech Production, Speech Perception.

Digital Object Identifier (DOI):

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1827


[1] H. Traunm├╝ller. Evidence for demodulation in speech perception. ICSLP, workshop on The Nature of Speech Perception, 2000
[2] H. Traunm├╝ller. Modulation and demodulation in production, perception, and imitation of speech and bodily gestures. in FONETIK 98, Dept. of Linguistics, Stockholm University, pp. 40 - 43. 1998.
[3] Y. Stylianou. Voice Conversion: Survey. icassp, pp.3585-3588, 2009.
[4] H. Traunm├╝ller. Perceptual dimension of openness in vowels. J. Acoust. Soc. Am. 69: 1465 -1475, especially Exp.2 - 4, pp. 1469 - 1472, 1981.
[5] H. Traunm├╝ller. The context sensitivity of the perceptual interaction between F0 and F1. Actes du XIIème Congres international des Science Phonetiques, Aix-en-Provence, vol. 5, pp. 62 - 65, 1991.
[6] H. Traunm├╝ller. Conventional, biological and environmental factors in speech communication: A modulation theory. Phonetica 51: 170 - 183, 1994.
[7] H. Traunm├╝ller. Articulatory and perceptual factors controlling the ageand sex-conditioned variability in formant frequencies of vowels,. Speech Comm. 3: 49 - 61, 1984.
[8] R.P. Fahey, and R.L. Diehl. The missing fundamental in vowel height perception. Perc. & Psychophys. 58: 725 - 733, 1996.
[9] A. Klinkert and D. Maurer. Fourier spectra and formant patterns of German vowels produced at F0 of 70 - 850 Hz J. Acoust. Soc. Am. 101: 3112 (A)., 1997.
[10] E. Zetterholm. Same speaker different voices: A study of one impersonator and some of his different imitations. Proc. Int. Conf. Speech Sci. & Tech., pages 70-75, 2006.
[11] A. Eriksson and P. Wretling. How flexible is the human voice?-A case study of mimicry. Proc. Eurospeech, pages 1043-1046, 1997.
[12] T. Kitamura. Acoustic analysis of imitated voice produced by a professional impersonator. Proc. Interspeech, pages 813-816, 2008.
[13] H. Kuwabara and Y. Sagisaka. Acoustic characteristics of speaker individuality: Control and conversion. Speech Communication,16(2):165-173, 1995.
[14] S. Furui. Digital Speech Processing, Synthesis, and Recognition. Marcel Dekker, 1989.
[15] L. Rabiner, and B.-H. Juang. Fundamental of Speech recognition Prentice-Hall, Upper Saddle River, NJ, 1993.
[16] M. Schröder. Emotional speech synthesis: A review. In Proc. Eurospeech-01,Scandinavia, 2001.
[17] M. Schröder. Speech and Emotion Research. An Overview of Research Frameworks and a Dimensional Approach to Emotional Speech Synthesis. PhD thesis, Institut f├╝r Phonetik , Universit├ñt des Saarlandes. Phonus no.7, 2004.
[18] S. Roehling, B. MacDonald, and C. Watson. Towards expressive speech synthesis in English on a robotic platform. In Proc. 11th Australasian International Conference on Speech Science and Technology, Auckland, New Zealand. Univ. of Auckland, 2006.
[19] K. Silverman, M. Beckman, M. Pierrehumbert, J. Ostendorf, M. Wightman, C. Price, P. and Hirschberg, J. Tobi. A standard scheme for labeling prosody. In Proc. ICSLP-92, Banff., 1992.
[20] R. Donovan, and E. Eide. The IBM trainable speech synthesis system. In Proc. ICSLP-98, Sydney, Australia, 1998.
[21] J. Pitrelli, R. Bakis, E. Eide, R. Fernandez, W. Hamza, and M. Picheny. The IBM expressive text-to-speech synthesis system for american english. IEEE Transactions on Audio, Speech and Language Processing, 14(4):1099-1108, 2006.
[22] Y. Stylianou, J. Laroche, and E. Moulines. High-Quality Speech Modification based on a Harmonic + Noise Model. Proc. EUROSPEECH, 1995.
[23] A. Kain. High resolution voice transformation. PhD thesis, OGI School of Science and Eng., Portland, Oregeon, USA.
[24] A. Mouchtaris, J. Van derSpiegel, and P.Mueller. Non parallel training for voice conversion based on a parameter adaptation. IEEE Trans. Audio, Speech, and Language Processing, 14(3):952-963, 2006.
[25] T. Toda, H. Saruwatari, and K. Shikano. Voice Conversion Algorithm based on Gaussian Mixture Model with Dynamic Frequency Warping of STRAIGHT spectrum. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pages 841-844, Salt Lake City, USA, 2001.
[26] D. Erro, T. Polyakova, and A. Moreno. On combining statistical methods and frequency warping for high-quality voice conversion. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 2008.
[27] T. Toda, A.W. Black, and K. Tokuda. Spectral Conversion Based on Maximum Likelihood Estimation considering Global Variance of Converted Parameter. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pages 9-12, Philadelphia, USA, 2005.
[28] L. Meshabi, V. Barreaud, and O. Boeffard. GMM-based Speech Transformation Systems under Data Reduction. 6th ISCA Workshop on Speech Synthesis, pages 119-124, August 22-24, 2007.
[29] H. Ye and S. Young. Quality-enhanced voice morphing using maximum likelihood transformations. IEEE Trans. Audio, Speech, and Language Processing, 14(4):1301-1312, July 2006.
[30] H. Duxans, A. Bonafonte, A. Kain, and J. van Santen. Including dynamic and phonetic information in voice conversion systems. Proc. ICSLP, pages 5-8, 2004.
[31] M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara. Voice conversion through vector quantization. In Proc. ICASSP88, pages 655-658, 1988.
[32] N. Iwahashi and Y. Sagisaka. Speech spectrum transformation based on speaker interpolation. In Proc. ICASSP94, 1994.
[33] O. Turk and L. M. Arslan. Robust processing techniques for voice conversion. Computer Speech and Language, 20:441-467, 2006.
[34] W. Verhelst and M. Roelands. An overlap-add technique based on waveform similarity (wsola) for high quality time-scale modification of speech. In Proc. ICASSP93, pages 554-557, 1993.
[35] J. van Santen, A. Kain, E. Klabbers, and T. Mishra. Synthesis of prosody using multi-level unit sequences. Speech Communication, 46:365-375, 2005.
[36] D. Vincent and O. Rosec. A new method for speech synthesis and transformation based on a ARX-LF source-filter decomposition and HNM modeling. in ICASSP, 2007.
[37] Y. Agiomyrgiannakis, O. Rosec. ARX-LF-based source-filter methods for voice modification and transformation. icassp, pp.3589-3592, 2009.
[38] R. J. McAulay and T. F. Quatieri. Speech analysis/synthesis based on a sinusoidal representation. IEEE Trans. Acoust., Speech, Signal Processing, ASSP-34(4):744-754, Aug 1986.
[39] P. Depalle and G. Poirrot. SVP: A modular system for analysis, processing and synthesis of sound signals. in Proceedings of the International Computer Music Conference, 1991.
[40] J. Laroche and M. Dolson. Improved phase vocoder timescale modification of audio. IEEE Transactions on Audio and Speech Processing, vol. 7, no. 3, 1999.
[41] H. Kawahara. Speech representation and transformation using adaptive interpolation of weighted spectrum: vocoder revisited. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pages 1303-1306, Munich, Germany, 1997.
[42] J. Liu, G. Beaudoin, and G. Chollet. Studies of glottal excitation and vocal tract parameters using inverse filtering and a parameterized input model. In Proc. ICSLP-92, pages 1051-1054, Banff, Alberta, Canada, 1992.
[43] P. Alku. Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering. Speech Communication, 11:109-118, 1992.
[44] O. O. Akande, and P. J. Murphy. Estimation of the vocal tract tranfer function with application to glottal wave analysis. Speech Communication, 46:15-36, 2005.
[45] D. G. Childers. Glottal source modeling for voice conversion. Speech Communication, 16:127-138, 1995.
[46] G. Fant, J. Liljentcrats, and Q. Lin. A four parameter model of glottal flow. In Quarterly Progress and Status Report, number 4 in STL-QPSR, pages 1-13. KTH, Stockholm, Sweden, 1985.
[47] C. d-Alessandro, and B. Doval. Experiments in voice quality modification of natural speech signals: the spectral approach. In Proc. 3rd ESCA/COCOSDA Workshop (ETRW) on Speech Synthesis, Jenolan Caves House, Blue Mountains, NSW, Australia, 1998.
[48] P. Mokhtari, H. R. Pfitzinger, and C. T. Ishi. Principal components of glottal waveforms: towards parameterisation and manipulation of laryngeal voice quality. In Proc. VOQUAL-03, Geneva, 2003.
[49] M. Lugger, B. Yang, and W. Wokurek. Robust estimation of voice quality parameters under real world disturbances. In Proc. ICASSP-06, pages 1097-1100, 2006.
[50] K. Shikano, K. Lee, and R. Reddy, "Speaker adaptation through vector quantization," in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, 1986, pp. 2643-2646.
[51] H. Valbret, E. Moulines, and J. Tubach. Voice transformation using PSOLA technique. Speech Communication, 11:175-187, 1992.
[52] A. Kain, and M. W. Macon. Spectral voice conversion for text-to-speech synthesis. In Proc. ICASSP-98, volume 1, pages 285-288, 1998.
[53] L. M. Arslan. Speaker transformation algorithm using segmental codebooks (STASC). Speech Communication, 28:211-226, 1999.
[54] O. Turk, and L. M. Arslan. Robust processing techniques for voice conversion. Computer Speech and Language, 20:441-467, 2006.
[55] Y. Stylianou, O. Cappé, and E. Moulines, E. Continuous probabilistic transform for voice conversion. IEEE Trans. on Speech and Audio Processing, 6(2):131-142, 1998.
[56] P. Woodland. Speaker adaptation for continuous density hmms: a review. In Proc. ITRW on Adaptation Methods for Speech Recognition, pages 11-19, Sophia Antipolis, 2001.
[57] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In Proc. Eurospeech-99, volume 5, pages 2347-2350, Budapest, Hungary, 1999.
[58] T.Masuko, T., Tokuda, K., Kobayashi, T., and Imai, S. (1997). Voice characteristics conversion for HMM-based speech synthesis. In Proc. ICASSP-97, pages 1611-1614.
[59] T. Yoshimura, T. Masuko, K. Tokuda, T. Kobayashi, and T. Kitamura. Speaker interpolation in HMM-based speech synthesis system. In Proc. Eurospeech-97, Rhodos, Greece, 1997.
[60] M. Tamura, T. Masuko, K. Tokuda, and T. Kobayashi. Speaker adaptation for HMM-based speech synthesis usingMLLR. In Proc. 3rd ESCA/COCOSDAWorkshop (ETRW) on Speech Synthesis, Blue Mountains, Australia, 1998.
[61] K. Shichiri, A. Sawabe, T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. Eigenvoices for HMM-based speech synthesis. In Proc. ICSLP-02, Denver, Colorado, 2002.
[62] O. Cappé, J. Laroche, and E. Moulines. Regularized estimation of cepstrum envelope from discrete frequency points. In Proc. IEEE ASSP Workshop on Applications of Signal Processing to Audio and Acoustics, Mohonk, 1995.
[63] E. Moulines, and F. Charpentier. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication, 9(5):453-467, 1990.
[64] E. Moulines, and W. Verhelst. Time-domain and frequency-domain techniques for prosodic modification of speech. In Kleijn, W. and Paliwal, K., editors, Speech Coding and Synthesis, chapter 15, pages 519-555. Elsevier Science B.V., 1995.
[65] Laver, J. (1980). The Phonetic Description of Voice Quality. Cambridge University Press.
[66] L.D. Alsteris and K.K. Paliwal. Short-time phase spectrum in speech processing: A review and some experimental results. Digital Signal Processing, 17:578-616, 2007.
[67] A. Kain, and M. W. Macon. Design and evaluation of a voice conversion algorithm based on spectral envelop mapping and residual prediction. In Proc. ICASSP-01, 2001.
[68] J. Yamagishi, H. Zen, Y.-J. Wu, T. Toda, and K. Tokuda. The HTS-2008 system: Yet another evaluation of the speaker-adaptive HMM-based speech synthesis system in the 2008 Blizzard Challenge. In Proc. Blizzard Challenge 2008, Brisbane, Australia, September 2008.
[69] G. Baudoin, and Y. Stylianou. On the transformation of the speech spectrum for voice conversion. In Proc. ICSLP-96, Philadelphia, PA, USA, 1996.