Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 30174
Improved Text-Independent Speaker Identification using Fused MFCC and IMFCC Feature Sets based on Gaussian Filter

Authors: Sandipan Chakroborty, Goutam Saha

Abstract:

A state of the art Speaker Identification (SI) system requires a robust feature extraction unit followed by a speaker modeling scheme for generalized representation of these features. Over the years, Mel-Frequency Cepstral Coefficients (MFCC) modeled on the human auditory system has been used as a standard acoustic feature set for speech related applications. On a recent contribution by authors, it has been shown that the Inverted Mel- Frequency Cepstral Coefficients (IMFCC) is useful feature set for SI, which contains complementary information present in high frequency region. This paper introduces the Gaussian shaped filter (GF) while calculating MFCC and IMFCC in place of typical triangular shaped bins. The objective is to introduce a higher amount of correlation between subband outputs. The performances of both MFCC & IMFCC improve with GF over conventional triangular filter (TF) based implementation, individually as well as in combination. With GMM as speaker modeling paradigm, the performances of proposed GF based MFCC and IMFCC in individual and fused mode have been verified in two standard databases YOHO, (Microphone Speech) and POLYCOST (Telephone Speech) each of which has more than 130 speakers.

Keywords: Gaussian Filter, Triangular Filter, Subbands, Correlation, MFCC, IMFCC, GMM.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1073555

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1942

References:


[1] J. P. Cambell, Jr., "Speaker Recognition:A Tutorial", Proceedings of The IEEE, vol. 85, no. 9, pp. 1437-1462, Sept. 1997.
[2] Faundez-Zanuy M. and Monte-Moreno E., "State-of-the-art in speaker recognition", Aerospace and Electronic Systems Magazine, IEEE, vol. 20, No. 5, pp. 7-12, Mar. 2005
[3] S. B. Davis and P. Mermelstein, "Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences", IEEE Trans. On ASSP, vol. ASSP 28, no. 4, pp. 357-365, Aug. 1980.
[4] R. Vergin, B. O- Shaughnessy and A. Farhat, "Generalized Mel frequency cepstral coefficients for large-vocabulary speakeridenpendent continuous-speech recognition, IEEE Trans. On ASSP, vol. 7, no. 5, pp. 525-532, Sept. 1999.
[5] Harrag A. Mohamadi T., Serignat J.F., "LDA Combination of Pitch and MFCC Features in Speaker Recognition", Proceedings of INDICON 2005, pp. 237-240, 11-13 Dec., IIT Chennai, India, 2005.
[6] K. Sri Rama Murty and B. Yegnanarayana, "Combining evidence from residual phase and MFCC features for speaker recognition", IEEE Signal Processing Letters, vol 13, no. 1, pp. 52-55, Jan. 2006.
[7] Yegnanarayana B., Prasanna S.R.M., Zachariah J.M. and Gupta C. S., "Combining evidence from source, suprasegmental and spectral features for a fixed-text speaker verification system", IEEE Trans. Speech and Audio Processing, Vol. 13, No. 4, pp. 575-582, July 2005.
[8] Chakroborty, S., Roy, A. and Saha, G., "Improved Closed set Text- Independent Speaker Identification by Combining MFCC with Evidence from Flipped Filter Banks". International Journal of Signal Processing, Vol. 4, No. 2, Page(s):114-122, 2007.
[9] J. Kittler, M. Hatef, R. Duin, J. Mataz, "On combining classifiers", IEEE Trans. Pattern Anal. Mach. Intell. 20 (1998) 226-239.
[10] D. Reynolds, R. Rose, "Robust text-independent speaker identification using gaussian mixture speaker models", IEEE Trans. Speech Audio Process., vol. 3, no.1, pp. 72-83, Jan. 1995.
[11] Laurent Besacier and Jean-Francois Bonastre, "Subband architechute for automatic speaker recognition", Signal Processing, vol-80, pp. 1245-1259, 2000.
[12] R. P. Lippmann, ``Speech recognition by machines and humans", Speech Communication, vol. 22, No. 1, pp. 1-15, 1997.
[13] Zheng F., Zhang, G. and Song, Z., "Comparison of different implementations of MFCC", J. Computer Science & Technology, vol. 16 no. 6, pp. 582-589, Sept. 2001.
[14] Ganchev, T., Fakotakis, N., and Kokkinakis, G. "Comparative Evaluation of Various MFCC Implementations on the Speaker Verification Task", Proc. of SPECOM 2005, October 17-19, 2005. Patras, Greece, vol. 1, pp.191-194.
[15] J. Campbell, "Testing with the YOHO CDROM voice verification corpus", ICASSP95, 1995, vol.1 pp. 341-344.
[16] Petrovska, D., et al. "POLYCOST: A Telephone-Speech Database for Speaker Recognition", RLA2C, Avignon, France, April 20-23, 1998, pp. 211-214.
[17] D. O- Shaughnessy, Speech Communication Human and Machine, Addison-Wesley, New York, 1987.
[18] Ben Gold and Nelson Morgan, Speech and Audio Signal Processing, Part- IV, Chap.14, pp. 189-203, John Willy & Sons ,2002.
[19] Daniel J. Mashao, Marshalleno Skosan, "Combining Classifier Decisions for Robust Speaker Identification", Pattern Recog,, vol. 39, pp. 147-155, 2006.
[20] A. Papoulis and S. U. Pillai, "Probability, Random variables and Stochastic Processes", Tata McGraw-Hill Edition, Fourth Edition, Chap. 4, pp. 72-122, 2002.
[21] Y. Linde, A. Buzo, and R. M. Gray, "An algorithm for vector quantizer design", IEEE Trans. Commun., vol. 28, no. 1, pp. 84-95, 1980.
[22] Daniel Garcia-Romero, Julian Fierrez-Aguilar, Joaquin Gonzalez- Rodriguez, Javier Ortega-Garcia, "Using quality measures for multilevel speaker recognition", Computer Speech and Language, Vol. 20, Issue 2-3, pp. 192-209, Apr. 2006,
[23] S.R. Mahadeva Prasanna, Cheedella S. Gupta b, B. Yegnanarayana, Extraction of speaker-specific excitation information from linear prediction residual of speech", Speech Communication, Vol. 48, Issue 10, pp. 1243- 1261, October 2006.
[24] H. Melin and J. Lindberg. "Guidelines for experiments on the polycost database", In Proceedings of a COST 250 workshop on Application of Speaker Recognition Techniques in Telephony, pp. 59- 69, Vigo, Spain, November 1996.