Speaker Identification by Joint Statistical Characterization in the Log Gabor Wavelet Domain
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 32804
Speaker Identification by Joint Statistical Characterization in the Log Gabor Wavelet Domain

Authors: Suman Senapati, Goutam Saha

Abstract:

Real world Speaker Identification (SI) application differs from ideal or laboratory conditions causing perturbations that leads to a mismatch between the training and testing environment and degrade the performance drastically. Many strategies have been adopted to cope with acoustical degradation; wavelet based Bayesian marginal model is one of them. But Bayesian marginal models cannot model the inter-scale statistical dependencies of different wavelet scales. Simple nonlinear estimators for wavelet based denoising assume that the wavelet coefficients in different scales are independent in nature. However wavelet coefficients have significant inter-scale dependency. This paper enhances this inter-scale dependency property by a Circularly Symmetric Probability Density Function (CS-PDF) related to the family of Spherically Invariant Random Processes (SIRPs) in Log Gabor Wavelet (LGW) domain and corresponding joint shrinkage estimator is derived by Maximum a Posteriori (MAP) estimator. A framework is proposed based on these to denoise speech signal for automatic speaker identification problems. The robustness of the proposed framework is tested for Text Independent Speaker Identification application on 100 speakers of POLYCOST and 100 speakers of YOHO speech database in three different noise environments. Experimental results show that the proposed estimator yields a higher improvement in identification accuracy compared to other estimators on popular Gaussian Mixture Model (GMM) based speaker model and Mel-Frequency Cepstral Coefficient (MFCC) features.

Keywords: Speaker Identification, Log Gabor Wavelet, Bayesian Bivariate Estimator, Circularly Symmetric Probability Density Function, SIRP.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1077989

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1591

References:


[1] Boll, S. F., "Suppression of Acoustic Noise in Speech using Spectral Subtraction", IEEE ASSP, 27(2):113-120, 1979.
[2] Berouti M., Schwartz R., and Makhoul J., "Enhancement of speech corrupted by acoustic noise", IEEE ICASSP, 1979, vol. 1, pp. 208-211.
[3] Y. Ephraim and D. Malah, "Speech Enhancement using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator", IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP- 32, no. 6, pp. 1109-1121, Dec. 1984.
[4] Y. Ephraim and D. Malah, "Speech enhancement using a minimum mean square error log-spectral amplitude estimator", IEEE Trans. on Acoust., Speech ,Signal Processing, vol. ASSP-33, pp. 443-445, Apr. 1985.
[5] T. H. Dat, K. Takeda and F. Itakura, "Generalized Gamma Modeling of Speech and its Online Estimation for Speech Enhancement", Proceedings of ICASSP-2005, 2005.
[6] R. Martin and C. Breithaupt, "Speech Enhancement in the DFT Domain using Laplacian Speech Priors", in Proc. International Workshop on Acoustic Echo and Noise Control (IWAENC 03), pp. 87-90, Kyoto, Japan, Sep. 2003.
[7] R. Martin, "Speech Enhancement Using MMSE Short Time Spectral Estimation with Gamma Distributed Speech Priors", IEEE ICASSP-02, Orlando, Florida, May 2002.
[8] H. Brehm, E.W. J¨ungst and D. Wolf, "Simulation von Sprachsignalen", AE¨U, Vol. 28, 1974, pp. 445-450.
[9] W. B. Davenport, "An experimental study of speech wave probability distributions", J. Acoust. Soc. Amer., Vol. 24, July 1952, pp. 390-399.
[10] Thomas Lotter and Peter Vary, "Speech Enhancement by MAP Spectral Amplitude Estimation Using a Super-Gaussian Speech Model", EURASIP Journal on Applied Signal Processing , vol. 2005, Issue 7, pp. 1110-1126.
[11] C. Breithaupt and R. Martin, "MMSE Estimation of Magnitude-Squared DFT Coefficients with Super-Gaussian Priors", IEEE Proc. Intern. Conf. on Acoustics, Speech and Signal Processing, vol. I, pp. 896-899, April 2003.
[12] Deng, J. Droppo, and A. Acero. "Estimating cepstrum of speech under the presence of noise using a joint prior of static and dynamic features", IEEE Transactions on Speech and Audio Processing, vol. 12, no. 3, May 2004, pp. 218-233.
[13] I. Cohen, "Speech Enhancement Using a Noncausal A Priori SNR Estimator", IEEE Signal Processing Letters, Vol. 11, No. 9, Sep. 2004, pp. 725-728.
[14] S. Kamath and P. Loizou, "A Multi-Band Spectral Subtraction Method for Enhancing Speech Corrupted by Colored Noise", In Proceedings International Conference on Acoustics, Speech and Signal Processing, 2002.
[15] E. Zavarehei, S. Vaseghi and Q. Yan, "Speech Enhancement using Kalman Filters for Restoration of Short-Time DFT Trajectories", Automatic Speech Recognition and Understanding (ASRU), 2005 IEEE Workshop, Nov. 27, 2005, pp. 219 - 224.
[16] Moreno P., Raj B., Stern R., "A vector Taylor series approach for environment-independent speech recognition", Proc. ICASSP, pp. 733- 736, 1996.
[17] Acero A., Deng L., Kristjansson T., Zhang J., "HMM adapation using vector Taylor series for noisy speech recognition", ICSLP Bejing, pp. 869-872, 2000.
[18] Gauvain J., Lee C., "MAP estimation for multivariate Gaussian mixture observation of Markov Chains", IEEE Trans. Speech & Audio Processing, 2, pp. 291-298, 1994.
[19] Leggetter C., Woodland P., "Maximum Likelihood Linear Regression for speaker adaptation of continuous density HMMs", Comp. Sp. & Lang., pp. 171-185, 1995.
[20] D. L. Donoho, "De-noising by soft-thresholding", IEEE Transactions on Information Theory, 41(3):613-627, 1995.
[21] D. L. Donoho and I. M. Johnstone, "Ideal spatial adaptation by wavelet shrinkage", Biometrika, 81(3):425-455, 1994.
[22] R. R. Coifman and D. Donoho, "Time-invariant wavelet denoising", In A. Antoniadis and G. Oppenheim, editors, Wavelets and Statistics, volume 103 of Lecture Notes in Statistics, pages 125-150, New York, 1995. Springer-Verlag.
[23] H. Brehm, "Description of spherically invariant random processes by means of G-functions", in: Lecture Notes in Computer Science, Vol. 969, Springer, New York, 1982, pp. 39-73.
[24] S. B. Davis and P. Mermelstein, "Comparison of parametric representation for monosyllabic word recognition in continuously spoken sentences", IEEE Trans. On ASSP, vol. ASSP 28, no. 4, pp. 357- 365, Aug. 1980.
[25] Molla, M. K. I., and K. Hirose, "On the effectiveness of mfccs and their statistical distribution properties in speaker identification", in Virtual Environments, Human-Computer Interfaces and Measurement Systems, VCIMS2004 IEEE Symposium, July 12-14, 2004, pp. 136-141.
[26] R. Vergin, D. OShaughnessy, and A. Farhat, "Generalized mel frequency cepstral coefficients for large-vocabulary speaker-independent continuous-speech recognition", IEEE Trans. On Speech and Audio Processing, vol. 7, no. 5, pp. 525-532, Sep. 1999.
[27] Douglas A. Reynolds, Richard C. Rose, "Robust Text- Independent Speaker Identification Using Gaussian Mixture Speaker Models", IEEE Transactions on Speech and Audio Processing, pp. 72-83, vol. 3, no. 1, January 1995.
[28] D. Donoho and I. Johnstone, "Ideal adaptation via wavelet shrinkage", Biometrika, vol. 81, pp. 425-455, 1994.
[29] D. Gabor, "Theory of communication", J. Inst. Electr. Eng. 93, pp. 429457, 1946.
[30] J. Morlet, G. Arens, E. Fourgeau and D. Giard, "Wave Propagation and Sampling Theory - Part II: Sampling theory and complex waves", Geophysics, 47(2):222-236, Feb. 1982.
[31] D. J. Field, "Relations between the statistics of natural images and the response properties of cortical cells", Journal of the Optical Society of America A, 4(12):2379-2394, Dec. 1987.
[32] S. Senapati and G. Saha, "Speech Enhancement by Marginal Statistical Characterization in Log gabor Wavelet domain", International J. of Signal Processing, vol. 4, no. 2, pp. 107-113, 2007.