{"title":"Improved Closed Set Text-Independent Speaker Identification by Combining MFCC with Evidence from Flipped Filter Banks","authors":"Sandipan Chakroborty, Anindya Roy, Goutam Saha","volume":23,"journal":"International Journal of Electronics and Communication Engineering","pagesStart":2554,"pagesEnd":2562,"ISSN":"1307-6892","URL":"https:\/\/publications.waset.org\/pdf\/3580","abstract":"
A state of the art Speaker Identification (SI) system requires a robust feature extraction unit followed by a speaker modeling scheme for generalized representation of these features. Over the years, Mel-Frequency Cepstral Coefficients (MFCC) modeled on the human auditory system has been used as a standard acoustic feature set for SI applications. However, due to the structure of its filter bank, it captures vocal tract characteristics more effectively in the lower frequency regions. This paper proposes a new set of features using a complementary filter bank structure which improves distinguishability of speaker specific cues present in the higher frequency zone. Unlike high level features that are difficult to extract, the proposed feature set involves little computational burden during the extraction process. When combined with MFCC via a parallel implementation of speaker models, the proposed feature set outperforms baseline MFCC significantly. This proposition is validated by experiments conducted on two different kinds of public databases namely YOHO (microphone speech) and POLYCOST (telephone speech) with Gaussian Mixture Models (GMM) as a Classifier for various model orders.<\/p>\r\n","references":"[1] J. P. Cambell, Jr., \"Speaker Recognition:A Tutorial\", Proceedings of The\r\nIEEE, vol. 85, no. 9, pp. 1437-1462, Sept. 1997.\r\n[2] S. B. Davis and P. Mermelstein, \"Comparison of Parametric\r\nRepresentation for Monosyllabic Word Recognition in Continuously\r\nSpoken Sentences\", IEEE Trans. On ASSP, vol. ASSP 28, no. 4, pp.\r\n357-365, Aug. 1980.\r\n[3] R. Vergin, B. O- Shaughnessy and A. Farhat, \"Generalized Mel\r\nfrequency cepstral coefficients for large-vocabulary speakeridenpendent\r\ncontinuous-speech recognition, IEEE Trans. On ASSP, vol.\r\n7, no. 5, pp. 525-532, Sept. 1999.\r\n[4] Ben Gold and Nelson Morgan, Speech and Audio Signal Processing,\r\nPart- IV, Chap.14, pp. 189-203, John Willy & Sons,2002.\r\n[5] U. G. Goldstein, \"Speaker identifying features based on formant tracks\",\r\nJ. Acoust. Soc. Am, vol. 59, No. 1, pp. 176-182, Jan. 1976.\r\n[6] Rabiner. L, Juang B. H, Fundamentals of speech recognition, Chap. 2,\r\npp. 11-65, Pearson Education, First Indian Reprint, 2003.\r\n[7] Daniel J. Mashao, Marshalleno Skosan, \"Combining Classifier\r\nDecisions for Robust Speaker Identification\", Pattern Recog,, vol. 39,\r\npp. 147-155, 2006.\r\n[8] Zheng F., Zhang, G. and Song, Z., \"Comparison of different\r\nimplementations of MFCC\", J. Computer Science & Technology, vol. 16\r\nno. 6, pp. 582-589, Sept. 2001.\r\n[9] Ganchev, T., Fakotakis, N., and Kokkinakis, G. \"Comparative\r\nEvaluation of Various MFCC Implementations on the Speaker\r\nVerification Task\", Proc. of SPECOM 2005, October 17-19, 2005.\r\nPatras, Greece, vol. 1, pp.191-194.\r\n[10] Faundez-Zanuy M. and Monte-Moreno E., \"State-of-the-art in speaker\r\nrecognition\", Aerospace and Electronic Systems Magazine, IEEE, vol.\r\n20, No. 5, pp. 7-12, Mar. 2005.\r\n[11] Yegnanarayana B., Prasanna S.R.M., Zachariah J.M. and Gupta C. S.,\r\n\"Combining evidence from source, suprasegmental and spectral features\r\nfor a fixed-text speaker verification system\", IEEE Trans. Speech and\r\nAudio Processing, Vol. 13, No. 4, pp. 575-582, July 2005.\r\n[12] K. Sri Rama Murty and B. Yegnanarayana, \"Combining evidence from\r\nresidual phase and MFCC features for speaker recognition\", IEEE Signal\r\nProcessing Letters, vol 13, no. 1, pp. 52-55, Jan. 2006.\r\n[13] S.R. Mahadeva Prasanna, Cheedella S. Gupta b, B. Yegnanarayana,\r\n\"Extraction of speaker-specific excitation information from linear\r\nprediction residual of speech\", Speech Communication, Vol. 48, Issue\r\n10, pp. 1243-1261, October 2006.\r\n[14] D. Reynolds, R. Rose, \"Robust text-independent speaker identification\r\nusing gaussian mixture speaker models\", IEEE Trans. Speech Audio\r\nProcess., vol. 3, no.1, pp. 72-83, Jan. 1995.\r\n[15] D. O- Shaughnessy, Speech Communication Human and\r\nMachine,Addison-Wesley, New York, 1987.\r\n[16] J. Kittler, M. Hatef, R. Duin, J. Mataz, \"On combining classifiers\",\r\nIEEE Trans. Pattern Anal. Mach. Intell. 20 (1998) 226-239.\r\n[17] Daniel Garcia-Romero, Julian Fierrez-Aguilar, Joaquin Gonzalez-\r\nRodriguez, Javier Ortega-Garcia, \"Using quality measures for multilevel\r\npeaker recognition\", Computer Speech and Language, Vol. 20, Issue 2-\r\n3, pp. 192-209, Apr. 2006.\r\n[18] J. Campbell, \"Testing with the YOHO CDROM voice verification\r\ncorpus\", ICASSP95, 1995, vol.1 pp. 341-344.\r\n[19] Petrovska, D., et al. \"POLYCOST: A Telephone-Speech Database for\r\nSpeaker Recognition\", RLA2C, Avignon, France, April 20-23, 1998, pp.\r\n211-214.\r\n[20] H. Melin and J. Lindberg. \"Guidelines for experiments on the polycost\r\ndatabase\", In Proceedings of a COST 250 workshop on Application of\r\nSpeaker Recognition Techniques in Telephony, pp. 59- 69, Vigo, Spain,\r\nNovember 1996.","publisher":"World Academy of Science, Engineering and Technology","index":"Open Science Index 23, 2008"}