Speech Detection Model Based on Deep Neural Network Classifier for Speech Emotions Recognition
Authors: A. Shoiynbek, D. Kuanyshbay, P. Menezes, A. Bekarystankyzy, A. Mukhametzhanov, T. Shoiynbek
Abstract:
Speech emotion recognition (SER) has received increasing research interest in recent years. It is a common practice to utilize emotional speech collected under controlled conditions recorded by actors imitating and artificially producing emotions in front of a microphone. There are three issues related to that approach: emotions are not natural, meaning that machines are learning to recognize fake emotions; emotions are very limited in quantity and poor in variety of speaking; there is some language dependency in SER; consequently, each time researchers want to start work with SER, they need to find a good emotional database in their language. This paper proposes an approach to create an automatic tool for speech emotion extraction based on facial emotion recognition and describes the sequence of actions involved in the proposed approach. One of the first objectives in the sequence of actions is the speech detection issue. The paper provides a detailed description of the speech detection model based on a fully connected deep neural network for Kazakh and Russian. Despite the high results in speech detection for Kazakh and Russian, the described process is suitable for any language. To investigate the working capacity of the developed model, an analysis of speech detection and extraction from real tasks has been performed.
Keywords: Deep neural networks, DNN, speech detection, speech emotion recognition, SER, Mel-frequency cepstrum coefficients, collecting speech emotion corpus, collecting speech emotion dataset, Kazakh speech dataset.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 21References:
[1] Wu, S. (2009). Recognition of human emotion in speech using modulation spectral features and support vector machines.
[2] Paeschke, A., & Sendlmeier, W. F. (2000, September). Prosodic characteristics of emotional speech: Measurements of fundamental frequency movements. In *ISCA ITRW on Speech and Emotion*. Newcastle, Northern Ireland, UK. ISCA Archive.23:25
[3] Ververidis, D., & Kotropoulos, C. (2003). A state of the art review on emotional speech databases. In *Proceedings of the 1st Richmedia Conference* (pp. 109–119). Lausanne, Switzerland.
[4] Rajoo, R., & Chee Aun, C. (2016, September). Influences of languages in speech emotion recognition: A comparative study using Malay, English, and Mandarin languages. In *2016 IEEE Symposium on Computer Applications & Industrial Electronics (ISCAIE)*. Batu Feringghi, Malaysia. IEEE.
[5] Kozhakhmet, K., Shoiynbek, A., & Kuanyshbay, D. (2019, October). Various languages impact on the problem of emotion recognition in speech. *Vestnik KazNRTU*, 5(135). Almaty.23:31
[6] WebRTC. (n.d.). *WebRTC*. https://webrtc.org/?hl=ru
[7] Nóbrega, R., & Cavaco, S. (2009). Detecting key features in popular music: Case study singing voice detection. In *Proceedings of the Workshop on Machine Learning and Music* of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases.
[8] Segal, Y., Fuchs, T. S., & Keshet, J. (2019). SpeechYOLO: Detection and localization of speech objects. In *Proceedings of Interspeech 2019*.
[9] Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition* (pp. 779–788).
[10] Mahdhaoui, A., Ringeval, F., & Chetounani, M. (2009, November 6–8). Emotional speech characterization based on multi-features fusion for face-to-face communication. In *Proceedings of the International Conference on SCS* (pp.
[page numbers if available]). Jerba, Tunisia.
[11] Ramírez, J., Segura, J., Górriz, J., & García, L. (2007). Improved voice activity detection using contextual multiple hypothesis testing for robust speech recognition. *IEEE Transactions on Audio, Speech, and Language Processing, 15*(8), 2177–2189.
[12] Kozhakhmet, K., Zhumaliyeva, R., Shoinbek, A., & Sultanova, N. (2020). Speech emotion recognition for Kazakh and Russian languages. Appl. Math. Inf. Sci, 14, 65-68.
[13] Makhambetov, O., Makazhanov, A., Yessenbayev, Z., Matkarimov, B., Sabyrgaliyev, I., & Sharafudinov, A. (2013, October). Assembling the Kazakh language corpus. In *Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing* (pp. 1022–1031). Association for Computational Linguistics.
[14] Slizhikova, A., Veysov, A., Nurtdinova, D., Voronin, D., & Baburov, Y. (n.d.). *Russian Open Speech to Text (STT/ASR) dataset*. GitHub. https://github.com/snakers4/open_stt/#annotation-methodology
[15] Fonseca, E., Plakal, M., Font, F., Ellis, D. P. W., Favory, X., Pons, J., & Serra, X. (2018). General-purpose tagging of Freesound audio with AudioSet labels: Task description, dataset, and baseline. In *Proceedings of the DCASE 2018 Workshop*.
[16] Fonseca, E., Plakal, M., Ellis, D. P. W., Font, F., Favory, X., & Serra, X. (2019). Learning sound event classifiers from web audio with noisy labels. *arXiv preprint arXiv:1901.01189*. https://arxiv.org/abs/1901.01189
[17] Ellis, D. P. W. (2007, April 21). *Chroma feature analysis and synthesis*. http://labrosa.ee.columbia.edu/matlab/chroma-ansyn/
[18] Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted Boltzmann machines. In *Proceedings of the 27th International Conference on Machine Learning (ICML-10)* (pp. 807–814).
[19] Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. *Journal of Machine Learning Research, 15*(1), 1929–1958.
[20] Glorot, X., & Bengio, Y. (2010, May). Understanding the difficulty of training deep feedforward neural networks. In *Proceedings of AISTATS 2010* (Vol. 9, pp. 249–256).
[21] Zhang, T. (2004, July). Solving large scale linear prediction problems using stochastic gradient descent algorithms. In *Proceedings of the Twenty-First International Conference on Machine Learning* (p. 116). ACM.
[22] Информбюро 31. (2015, June 15). Video. YouTube. https://www.youtube.com/watch?v=w-LR8rs3d44