Automatic Segmentation of the Clean Speech Signal
Authors: M. A. Ben Messaoud, A. Bouzid, N. Ellouze
Abstract:
Speech Segmentation is the measure of the change point detection for partitioning an input speech signal into regions each of which accords to only one speaker. In this paper, we apply two features based on multi-scale product (MP) of the clean speech, namely the spectral centroid of MP, and the zero crossings rate of MP. We focus on multi-scale product analysis as an important tool for segmentation extraction. The MP is based on making the product of the speech wavelet transform coefficients (WTC). We have estimated our method on the Keele database. The results show the effectiveness of our method. It indicates that the two features can find word boundaries, and extracted the segments of the clean speech.
Keywords: Speech segmentation, Multi-scale product, Spectral centroid, Zero crossings rate.
Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1099502
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2509References:
[1] F. Kubala, T. Anastasakos, H. Jin, L. Nguyen, and R. M. Schwartz. “Transcribing radio news,” in Proc. ICSLP, 1996.
[2] L. Zhang, H. J. Lu, "Speaker change detection and tracking in real time news broadcasting analysis," in Proc. ACM Multimedia, 2002, pp. 602- 610.
[3] S. E. Tranter, K. Yu, G. Evermann, and P. C. Woodland. “Generating and evaluating segmentations for automatic speech recognition of conversational telephone speech,” in Proc. ICASSP, Canada, 2004, pp. 753-756.
[4] J. Wang, H. Sung, and P. Lin, "Unsupervised change detection using SVM misclassification rate," IEEE Trans. Computers, vol. 56, pp. 1234– 1244, 2009.
[5] I. McCowan, H. Bourland, and J. Ajmera, "speech/music segmentation using entropy," Speech Comm., vol. 40, pp. 351–363, 2003.
[6] D. Wang, R. Vogt, M. Mason, and S. Sridharan, "Automatic audio segmentation using the GLR," in Proc. International Conference on Signal process. Comm. Systems, Australia, 2008, pp. 1-5.
[7] J. Hansen, and B. Zhou, "Unsupervised audio stream segmentation via the BIC," in Proc. ICSLP, 2000, pp. 714-717.
[8] D. Elter, T. Sikora, and H. Kim, "Hybrid speaker based segmentation system using MLC," in Proc. International Conference on Acoustics, Speech and Signal Processing, 2005, pp. 745-748.
[9] S. Tranter, and D. Reynolds, “Speaker diarization for broadcast news,” in the Speaker and Language Recognition Workshop, ODYSSEY'04, 2004, Spain.
[10] S. Mallat, A Wavelet Tour of Signal Processing The Sparse Way. 3rd ed., Academic Press Elsevier, 2008.
[11] M. A. Ben Messaoud, A. Bouzid, and N. Ellouze, 2013. “An efficient method for fundamental frequency determination of noisy speech,” in LNAI 7911, T. Drugman, T. Dutoit, Eds. Verlag Berlin Heidelberg: Springer, pp. 33–41.
[12] G. Meyer, F. Plante, and W. A. Ainsworth, “A pitch extraction reference database,” in Proc. EUROSPEECH, Madrid, 1995, pp. 837–840.