High-Individuality Voice Conversion Based on Concatenative Speech Synthesis

Kei Fujii; Jun Okawa; Kaori Suigetsu

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 32797

High-Individuality Voice Conversion Based on Concatenative Speech Synthesis

Authors: Kei Fujii, Jun Okawa, Kaori Suigetsu

Abstract:

Concatenative speech synthesis is a method that can make speech sound which has naturalness and high-individuality of a speaker by introducing a large speech corpus. Based on this method, in this paper, we propose a voice conversion method whose conversion speech has high-individuality and naturalness. The authors also have two subjective evaluation experiments for evaluating individuality and sound quality of conversion speech. From the results, following three facts have be confirmed: (a) the proposal method can convert the individuality of speakers well, (b) employing the framework of unit selection (especially join cost) of concatenative speech synthesis into conventional voice conversion improves the sound quality of conversion speech, and (c) the proposal method is robust against the difference of genders between a source speaker and a target speaker.

Keywords: concatenative speech synthesis, join cost, speaker individuality, unit selection, voice conversion

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1054889

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1882

References:

[1] Y. Stylianou, O. Cappé, and E. Moulines, "Statistical methods for voice quality transformation," Proc. of EUROSPEECH, pp.447-450, September 1995.
[2] A. Kain, and M. W. Macon, "Spectral voice conversion for text-to-speech synthesis," Proc. of International Conference on Acoustics, Speech and Signal Processing, Vol. 1, pp.285-288, 1998.
[3] T. Toda, H. Saruwatari, and K. Shikano, "Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of straight spectrum," Proc. of International Conference on Acoustics, Speech and Signal Processing, Vol. 2, pp.841-844, 2001.
[4] M. Abe, "A segment-based approach to voice conversion," Proc. of International Conference on Acoustics, Speech and Signal Processing, pp.765-768, 1991.
[5] D. S├╝ndermann, H. Höge, A. Bonafante, H. Ney, A. Black, and S. Narayanan, "Text-independent voice conversion based on unit selection," Proc. of International Conference on Acoustics, Speech and Signal Processing, 2006.
[6] E. Keller, G. Bailly, A. Monaghan, J. Terken, and M. Huckvale, Improvements in Speech Synthesis, John Wiley & Sons, 1st Ed. 2001, ch. 1.
[7] N. Campbell, "CHATR: A high-definition speech re-sequencing system," Proc. of ASA/ASJ Joint Meeting, pp.1223-1228, Honolulu, December 1996.
[8] N. Campbell, and A. W. Black, "Prosody and the selection of source units for concatenative synthesis," in Progress in Speech Synthesis, Springer Verlag, Inc., New York, 1995, ch. 22.
[9] H. Kawai, T. Toda, J. Ni, M. Tsuzaki, and K. Tokuda, "Ximera: A New TTS from ATR Based on Corpus-Based Technologies,"' Proc. of ISCA 5th Speech Synthesis Workshop, pp.179-184, Pittsburgh, U.S.A., June 2004.
[10] Synthetic speech sample demonstration of CHATR. Available: http://feast.atr.jp/chatr/chatr/e_tour/synth_examples.html
[11] Open-Source Large Vocabulary CSR Engine Julius. Available: http://julius.sourceforge.jp/en_index.php?q=en/index.html
[12] Speech Signal Processing Toolkit (SPTK) Ver 3.0. Available: http://kt-lab.ics.nitech.ac.jp/%7Etokuda/SPTK/index.html
[13] The Snack Sound Toolkit. Available: http://www.speech.kth.se/snack/
[14] K. Fujii, R. Ueda, H. Kashioka and N. Campbell, "A trial to apply concatenative speech synthesis to spontaneous speech," Proc. of International Technical Conference on Circuits/Systems, Computers and Communications, Vol. 2, pp.653-656, 2006.