Applications of Support Vector Machines on Smart Phone Systems for Emotional Speech Recognition

Wernhuar Tarng; Yuan-Yuan Chen; Chien-Lung Li; Kun-Rong Hsie; Mingteh Chen

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 33122

Applications of Support Vector Machines on Smart Phone Systems for Emotional Speech Recognition

Authors: Wernhuar Tarng, Yuan-Yuan Chen, Chien-Lung Li, Kun-Rong Hsie, Mingteh Chen

Abstract:

An emotional speech recognition system for the applications on smart phones was proposed in this study to combine with 3G mobile communications and social networks to provide users and their groups with more interaction and care. This study developed a mechanism using the support vector machines (SVM) to recognize the emotions of speech such as happiness, anger, sadness and normal. The mechanism uses a hierarchical classifier to adjust the weights of acoustic features and divides various parameters into the categories of energy and frequency for training. In this study, 28 commonly used acoustic features including pitch and volume were proposed for training. In addition, a time-frequency parameter obtained by continuous wavelet transforms was also used to identify the accent and intonation in a sentence during the recognition process. The Berlin Database of Emotional Speech was used by dividing the speech into male and female data sets for training. According to the experimental results, the accuracies of male and female test sets were increased by 4.6% and 5.2% respectively after using the time-frequency parameter for classifying happy and angry emotions. For the classification of all emotions, the average accuracy, including male and female data, was 63.5% for the test set and 90.9% for the whole data set.

Keywords: Smart phones, emotional speech recognition, socialnetworks, support vector machines, time-frequency parameter, Mel-scale frequency cepstral coefficients (MFCC).

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1072525

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1847

References:

[1] Skiba, B., Johnson, M., Dillon, M. and Harrison, C., (2000). Moving in mobile media mode, http://www.regisoft.com/articles/lehman.pdf.
[2] Shneiderman, B. (1992). Designing the user interface: strategies for effective human-computer interaction. Reading: Addison-Wesley.
[3] Plutchik, R. (1980). A general psychoevolutionary theory of emotion. San Diego, CA: Academic Press.
[4] Russell, J. A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39, 1161-1178.
[5] Posner, J., Russell, J. A. and Peterson, B. S. (2005). A circumplex model of affect: an integrative approach to affective.
[6] Yen-Kung Yang (2003). Science Development. 367, 70-73.
[7] E. Douglas-Cowie, R. Cowie, and M. Schröder. (2000). Emotional speech: towards a new generation of databases. Speech Communication, a special issue on Speech and Emotion, 40(1-2), 33-60.
[8] Cover, T. M and Hart, P. E. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13, 21-27.
[9] Dimitrios Ververidis and Constantine Kotropoulos. (2006). Emotional speech recognition: Resources, features and methods. Speech Communication, 48 (9) 1162-1181.
[10] Cai, L., Jiang, C., Wang, Z., Zhao, L., and Zou, C. (2003). A method combining the global and time series structure features for emotion recognition in speech. In Proceedings of International Conference on Neural Networks and Signal Processing (ICNNSP-03), 2, 904-907.
[11] Kwon, O. W., Chan, K., Hao, J., and Lee, T. W. (2003). Emotion recognition by speech signal. The Eighth European Conference on Speech Communication and Technology (EUROSPEECH-03), Geneva, Switzerland.
[12] Schuller, B., Rigoll, G., and Lang, M. (2003). Hidden Markov model based speech emotion recognition. 28th IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP-03).
[13] Vogt, T. and Andr'e, E. (2006). Improving automatic emotion recognition from speech via gender differentiation. Language Resources and Evaluation Conference.
[14] Petrushin, V. A. (2004). Emotion recognition in speech signal: experimental study, development, and application." Sixth International Conference on Spoken Language Processing (ICSLP).
[15] Reynolds, D. A. and Rose, R. C. (1995) .Robust text-independent speaker identification using Gaussian mixture models. In Proceedings of the European Conference on Speech Communication and Technology, 963-966.
[16] Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77, 257-286.
[17] K. Fukunaga. (1990). Introduction to statistical pattern recognition. San Diego, CA: Academic Press.
[18] Cover, T. M and Hart, P. E. (1967).Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13, 21-27.
[19] E. H. Han, G. Karypis and V. Kumar. (2001). Text categorization using weight adjusted k-nearest neighbor classification. Pacific-Asia Conference on Knowledge Discovery and Data Mining, 53-65.
[20] Rabiner, L. R. and Ronald W. Schafer. (1989). Digital processing of speech signals. Prentice-Hall, Inc., Englewood Cliffs, NJ.
[21] Yao X. (1999). Evolving artificial neural networks. Proceedings of the IEEE , 87(9), 1423-1447.
[22] V. N. Vapnik. (2000).The nature of statistical learning theory. Chapter 5-6, 138-167, Springer-Verlag, New York.
[23] C. C. Chang and C. J. Lin (2001). LIBSVM: a library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm.