Blind Speech Separation Using SRP-PHAT Localization and Optimal Beamformer in  Two-Speaker Environments

Hai Quang Hong Dam; Hai Ho; Minh Hoang Le Ngo

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 33093

Blind Speech Separation Using SRP-PHAT Localization and Optimal Beamformer in Two-Speaker Environments

Authors: Hai Quang Hong Dam, Hai Ho, Minh Hoang Le Ngo

Abstract:

This paper investigates the problem of blind speech separation from the speech mixture of two speakers. A voice activity detector employing the Steered Response Power - Phase Transform (SRP-PHAT) is presented for detecting the activity information of speech sources and then the desired speech signals are extracted from the speech mixture by using an optimal beamformer. For evaluation, the algorithm effectiveness, a simulation using real speech recordings had been performed in a double-talk situation where two speakers are active all the time. Evaluations show that the proposed blind speech separation algorithm offers a good interference suppression level whilst maintaining a low distortion level of the desired signal.

Keywords: Blind speech separation, voice activity detector, SRP-PHAT, optimal beamformer.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1126231

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1387

References:

[1] K. Nakadai, K. Nakamura, and G. Ince, “Real-time super-resolution sound source localization for robots,” IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 694–699, Oct. 2012.
[2] M. Brandstein and D. Ward, Eds., Microphone Arrays: Signal Processing Techniques and Applications, Springer- Verlag, 2001.
[3] M. Fallon and S. Godsill, “Acoustic source localization and tracking of a time-varying number of speakers,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 4, pp. 1409–1415, May 2012.
[4] H. Q. Dam, S. Nordholm, H. H. Dam, and S. Y. Low, “Postfiltering using multichannel spectral estimation in multi-speaker environments,” EURASIP Journal on Advances in Signal Processing, pp. 1–10, Jan. 2008, ID 860360.
[5] N. Grbic´, X. J. Tao, S. Nordholm, and I. Claesson, “Blind signal separation using overcomplete subband representation,” IEEE Transactions on Speech and Audio Processing, vol. 9, no. 5, pp. 524–533, July 2001.
[6] H. Sawada, S. Araki, and S. Makino, “Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 3, pp. 516–527, March 2011.
[7] J. Benesty, S. Makino, and J. Chen, Eds., Speech Enhancement, Springer-Verlag, 2005.
[8] P. Krishnamoorthy and S. R. Mahadeva Prasanna, “Two speaker speech separation by lp residual weighting and harmonics enhancement,” International Journal of Speech Technology, vol. 13, no. 3, pp. 117–139, Sep. 2010.
[9] H. Q. Dam, “Blind multi-channel speech separation using spatial estimation in two-speaker environments,” Journal of Science and Technology, Special Issue on Theories and Application of Computer Science, vol. 48, no. 4, pp. 109–119, Dec. 2010.
[10] H. Q. Dam and S. Nordholm, “Sound source localization for subband-based two speech separation in room environment,” International Conference on Control, Automation and Information Sciences (ICCAIS), pp. 223– 227, Dec. 2013.
[11] S. Araki, R. Mukai, S. Makino, T. Nishikawa, and H. Saruwatari, “The fundamental limitation of frequency domain blind source separation for convolutive mixtures of speech,” IEEE Trans. on Speech and Audio Processing, vol. 11, no. 2, pp. 109–116, Mar. 2003.
[12] Shahab Faiz Minhas and Patrick Gaydecki, “A hybrid algorithm for blind source separation of a convolutive mixture of three speech sources,” EURASIP Journal on Advances in Signal Processing, vol. 1, no. 92, pp. 1–15, Jan. 2014.
[13] M. Cobos, A. Marti, and J. J. Lopez, “A modified SRP-PHAT functional for robust real-time sound source localization with scalable spatial sampling,” IEEE Signal Processing Letters, vol. 18, no. 1, pp. 71–74, Nov. 2010.
[14] L. Saul, D. Lee, C. Isbell, Y LeCun, “Real time voice processing with audiovisual feedback: Toward autonomous agents with perfect pitch”, Advances in Neural Information Processing Systems 15 (NIPS 2002), pp. 1205-1212, 2002.
[15] A Belouchrani, K Abed-Meraim, J-F Cardoso, E Moulines, “A Blind Source Separation Technique Using Second-Order Statistics”, IEEE Transactions on Signal Processing, vol. 45, no. 2, pp. 434-444, Feb. 1997
[16] H. Q. Dam, S. Nordholm, H. H. Dam, and S. Y. Low, “Adaptive beamformer for hands-free communication system in noisy environments,” IEEE Int. Symposium on Circuits and Systems, vol. 2, pp. 856–859, May 2005