Speaker Identification by Atomic Decomposition of Learned Features Using Computational Auditory Scene Analysis Principals in Noisy Environments
Speaker recognition is performed in high Additive White Gaussian Noise (AWGN) environments using principals of Computational Auditory Scene Analysis (CASA). CASA methods often classify sounds from images in the time-frequency (T-F) plane using spectrograms or cochleargrams as the image. In this paper atomic decomposition implemented by matching pursuit performs a transform from time series speech signals to the T-F plane. The atomic decomposition creates a sparsely populated T-F vector in “weight space” where each populated T-F position contains an amplitude weight. The weight space vector along with the atomic dictionary represents a denoised, compressed version of the original signal. The arraignment or of the atomic indices in the T-F vector are used for classification. Unsupervised feature learning implemented by a sparse autoencoder learns a single dictionary of basis features from a collection of envelope samples from all speakers. The approach is demonstrated using pairs of speakers from the TIMIT data set. Pairs of speakers are selected randomly from a single district. Each speak has 10 sentences. Two are used for training and 8 for testing. Atomic index probabilities are created for each training sentence and also for each test sentence. Classification is performed by finding the lowest Euclidean distance between then probabilities from the training sentences and the test sentences. Training is done at a 30dB Signal-to-Noise Ratio (SNR). Testing is performed at SNR’s of 0 dB, 5 dB, 10 dB and 30dB. The algorithm has a baseline classification accuracy of ~93% averaged over 10 pairs of speakers from the TIMIT data set. The baseline accuracy is attributable to short sequences of training and test data as well as the overall simplicity of the classification algorithm. The accuracy is not affected by AWGN and produces ~93% accuracy at 0dB SNR.
Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1124331Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 944
 A. S. Bregman, Auditory Scene Analysis. Cambridge, MA: MIT Press, 1990.
 Zhao, X., Shao, Y., Wang, D. CASA-Based Robust Speaker Identification IEEE transactions on audio, speech, and language processing, vol. 20, no. 5, July 2012
 Lee, H., Largman, Y., Pham, P., Ng, A. Unsupervised feature learning for audio classification using convolutional deep belief networks. Conference proceedings: Advances in Neural Information Processing Systems 22, 2009.
 Gabor, D., Theory of communication, J. Inst. Elect. Eng., 93, pp. 429–457. 1946.
 Mallat, S., Zhang, Z. Matching Pursuits with Time-Frequency Dictionaries. IEEE transactions on signal processing. Vol 41. No 12. 1993.
 Olshausen, B., Field, D., Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583):607–9, 1996.
 S. Haykin. Neural Networks and Learning Machines, third edition. Pearson Education, Inc. Prentice Hall 2009. Page 516.
 Grosse, R., Raina, R., Kwong, H., Ng, A., Shift-Invariant Sparse Coding for Audio Classification, UAI 2011.
 Bryan, T., Kepuska, V., Kostanic, I., A Simple Adaptive Atomic Decomposition Voice Activity Detector Implemented by Matching Pursuit, World Academy of Science, Engineering and Technology International Journal of Computer, Electrical, Automation, Control and Information Engineering Vol:9, No:5, 2015.
 Bryan, T., Kepuska, V., Kostanic, I., Atomic Decomposition Audio Data Compression and Denoising using Sparse Dictionary Feature Learning, World Academy of Science, International Science Index vol:10 no:01