[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.5391/IJFIS.2014.14.4.240

Text-independent Speaker Identification Using Soft Bag-of-Words Feature Representation

Jiang, Shuangshuang (Multimedia Research Lab, CECS Dept., University of Louisville)
Frigui, Hichem (Multimedia Research Lab, CECS Dept., University of Louisville)
Calhoun, Aaron W. (Pediatrics Dept., University of Louisville)

Publication Information

International Journal of Fuzzy Logic and Intelligent Systems / v.14, no.4, 2014 , pp. 240-248 More about this Journal

Abstract

We present a robust speaker identification algorithm that uses novel features based on soft bag-of-word representation and a simple Naive Bayes classifier. The bag-of-words (BoW) based histogram feature descriptor is typically constructed by summarizing and identifying representative prototypes from low-level spectral features extracted from training data. In this paper, we define a generalization of the standard BoW. In particular, we define three types of BoW that are based on crisp voting, fuzzy memberships, and possibilistic memberships. We analyze our mapping with three common classifiers: Naive Bayes classifier (NB); K-nearest neighbor classifier (KNN); and support vector machines (SVM). The proposed algorithms are evaluated using large datasets that simulate medical crises. We show that the proposed soft bag-of-words feature representation approach achieves a significant improvement when compared to the state-of-art methods.

Keywords

Speaker identification; Clustering; Bag-of-Words (BoW) feature representation; Fuzzy membership; Possibilistic membership; Naive Bayes classifier;

Citations & Related Records

Reference

1	L. Kaufman and P. Rousseeuw, "Finding groups in data: An introduction to cluster analysis," New York: Wiley, 1990.
2	F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, andW. A. Stahel, "Robust statistics the approach based on influence functions," New York: Wiley, 1986.
3	N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, first ed., 2000.
4	S. S. Chen and P. S. Gopalakrishnan, "Speaker, environment and channel change detection and clustering via the bayesian information criterion," in in Proc. DARPA Broadcast News Transcription Understanding Workshop, (Landsdowne, VA), 1998.
5	C. M. Bishop, "Pattern recognition and machine learning," Springer, 2006.
6	Y. Rubner, C. Tomasi, and L. Guibas, "The earth mover's distance as a metric for image retrieval," International Journal of Computer Vision, vol. 40, no. 2, pp. 99-121, 2000. DOI ScienceOn
7	S. Tranter and D. Reynolds, "An overview of automatic speaker diarization systems," IEEE Trans. Audio, Speech and Language Processing, vol. 14, pp. 1557-1565, September 2006. DOI ScienceOn
8	D. Reynolds, T. Quatieri, and R. Dunn, "Speaker verification using adapted gaussian mixture models," Digital Signal Processing, vol. 10, pp. 19-41, 2000. DOI ScienceOn
9	H. Balti and H. Frigui, "Feature mapping and fusion for music genre classification," ICMLA, pp. 306-310, 2012.
10	A. McCallum and K. Nigam, "A comparison of event models for naive bayes text classification," AAAI-98 workshop on learning for text categorization, pp. 41-48, 1998.
11	K. B. B. Peskin, "Text-constrained speaker recognition on a text-independent task," In ODYS-2004, pp. 129-134, 2004.
12	G. Csurka, C. Dance, L. Fan, J.Willamowski, and C. Bray, "Visual categorization with bags of keypoints," Proc. of ECCV International Workshop on Statistical Learning in Computer Vision, 2004.
13	J. Sivic, "Efficient visual search of videos cast as text retrieval," IEEE Trans. on pattern analysis and machine intelligence, vol. 31, no. 4, pp. 591-605, 2009. DOI ScienceOn
14	H. Frigui, "Membershipmap: Data transformation based on granulation and fuzzy membership aggregation," IEEE Trans. Fuzzy Systems, vol. 14, pp. 885-896, Dec 2006. DOI ScienceOn
15	K. Ishiguro, T. Yamada, S. Araki, T. Nakatani, and H. Sawada, "Probabilistic speaker diarization with bagof-words representations of speaker angle information," IEEE Trans. on audio, speech, and language processing, vol. 20, no. 2, pp. 447-460, 2012. DOI ScienceOn
16	J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms. Norwell, MA, USA: Kluwer Academic Publishers, 1981.
17	R. Duda, P. Hart, and D. Stork, "Pattern classification, 2nd edition," New York: John Wiley & Sons, 2000.
18	S. Cheng, H. Wang, and H. Fu, "Bic-based speaker segmentation using divide-and-conquer strategies with application to speaker diarization," in IEEE Trans. on Audio, Speech, and Language Processing, vol. 18 of 1, pp. 141-157, 2010. DOI ScienceOn
19	J. Campbell, "Speaker recognition: a tutorial," Proceedings of the IEEE, vol. 85, no. 9, pp. 1437-1462, 1997. DOI ScienceOn
20	S. S. Stevens, J. Volkmann, and E. B. Newman, "A scale for the measurement of the psychological magnitude pitch," The Journal of the Acoustical Society of America, vol. 8, no. 3, pp. 155-210, 1937. DOI
21	X. Huang, A. Acero, and H. Hon, "Spoken language processing: a guide to theory, algorithm, and system development," Prentice-Hall, New Jersey, 2001.
22	T. Kinnunen and H. Li, "An overview of text-independent speaker recognition: from features to supervectors," Speech Communication, vol. 52, pp. 12-40, Jan. 2010. DOI ScienceOn
23	X. Miro, S. Bozonnet, N. Evans, C. Fredouille, and G. F. abd O. Vinyals, "Speaker diarization: A review of recent research," IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 2, pp. 356-370, 2012. DOI ScienceOn
24	H. Hermansky, "Perceptual linear predictive (plp) analysis of speech," J. Acoust. Soc. Am., vol. 87, no. 4, pp. 1738-1752, 1990. DOI
25	L. Wang, K. Minami, K. Yamamoto, and S. Nakagawa, "Speaker identification by combining mfcc and phase information in noisy environment," in ICASSP, 2010.
26	Q. Li and Y. Huang, "Robust speaker identification using an auditory-based feature," in ICASSP, 2010.
27	R. Zheng and B. X. S. Zhang, "Text-independent speaker identification using gmm-ubm and frame level likelihood normalization," in International Symposium on Chinese Spoken Language Processing, pp. 289-292, Dec. 2004.
28	Q. Wu, L. Zhang, and G. Shi, "Robust feature extraction for speaker recognition based on constrained nonnegative tensor," J. Comput. Sci. Technol., vol. 25, no. 4, pp. 745-754, 2010.
29	S. Jiang, H. Frigui, and A. Calhoun, "Semantic indexing of video simulations for enhancing medical care during crises," 11th International Conference on Machine Learning and Applications (ICMLA), pp. 520-525, 2012.