[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.13089/JKIISC.2019.29.6.1393

Speaker Verification Model Using Short-Time Fourier Transform and Recurrent Neural Network

Kim, Min-seo (Graduate School of Information Security, Korea University)
Moon, Jong-sub (Graduate School of Information Security, Korea University)

Publication Information

Journal of the Korea Institute of Information Security & Cryptology / v.29, no.6, 2019 , pp. 1393-1401 More about this Journal

Abstract

Recently as voice authentication function is installed in the system, it is becoming more important to accurately authenticate speakers. Accordingly, a model for verifying speakers in various ways has been suggested. In this paper, we propose a new method for verifying speaker verification using a Short-time Fourier Transform(STFT). Unlike the existing Mel-Frequency Cepstrum Coefficients(MFCC) extraction method, we used window function with overlap parameter of around 66.1%. In this case, the speech characteristics of the speaker with the temporal characteristics are studied using a deep running model called RNN (Recurrent Neural Network) with LSTM cell. The accuracy of proposed model is around 92.8% and approximately 5.5% higher than that of the existing speaker certification model.

Keywords

Speaker verification; STFT; Deep Learning; Recurrent Neural Network(RNN);

Citations & Related Records

Reference

1	F. Bimbot, J. Bonastre, C. Fredouille, G. Gravier, I. Magrin-Chagnolleau, S. Meignier & D. Reynolds, "A tutorial on text-independent speaker verification," EURASIP Journal on Advances in Signal Processing, Apr. 2004
2	J. Hai & E. M. Joo, "Improved linear predictive coding method for speech recognition," Fourth International Conference on Information, vol. 3, pp. 1614-1618, Dec. 2003
3	L. Muda, M. Begam & I. Elamvazuthi, "Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques," Journal of Computing, vol. 2, Issue 3, Mar. 2010
4	H. Hermansky, "Perceptual linear predictive (PLP) analysis of speech," Journal of the Acoustical Society of America, vol. 87, pp. 1738-1752, 1990 DOI
5	T. Baba, "Time-Frequency Analysis Using Short Time Fourier Transform," Open Acoustics Journal, 2012
6	E. Variani, X. Lei, E. XDermott, I. L. Moreno & J. Gonzalez-Dominguez, "Deep neural networks for small footprint text-dependent speaker verification," IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP) IEEE, pp. 4052-4056, May. 2014
7	H. Sak, A. Senior & F. Beaufays, "Long short-term memory recurrent neural network architectures for large scale acoustic modeling," Fifteenth annual conference of the international speech communication association, 2014
8	C. Van Loan, "Computational frameworks for the fast Fourier transform," vol. 10, 1992
9	L. Liu, J. He & G. Palm, "Effects of phase on the perception of intervocatic stop consonants," Speech Communication, vol. 22, pp. 403-417, Sep. 1997 DOI
10	Z. Zhang & A. Subramaya, "Text-dependent speaker verification," U.S. Patent no. 8, 2012
11	K. Paliwal & L. Alsteris, "Usefulness of phase spectrum in human speech perception," EUROSPEECH, pp. 21187-2120, Sep. 2003
12	D. P. Kingma & J. Ba, "Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980, 2014
13	L. D. Alsteris & K. Paliwal, "Importance of window shape for phase-only reconstruction of speech," IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP) IEEE, pp. 573-576, May. 2004
14	H. Georg, M. Ignacio, B. Samy & S. Noam, "End-to-end text-dependent sp eaker verification." IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP) IEEE, pp. 5115-5119, 2016
15	T. Amirsina, D. Jeremy & M. Nasser, "Text-Independent Speaker Verification Using 3D CNN," arXiv, June. 2018
16	L. D. Alsteris & K. Paliwal, "Further intelligibility results from human listening tests using the short-time phase spectrum," Speech Communication, vol. 48, pp. 727-736, Jun. 2006 DOI
17	S. Kanai, Y. Fujiwara, Y. Yamanaka & S. Adachi, "Sigsoftmax: Reanalysis of the softmax bottleneck," Advances in Neural Information Processing Systems, pp. 286-296, 2018
18	I. Loshchilov & F. Hutter, "Sgdr: Stochastic gradient descent with warm restarts," arXiv preprint arXiv:1608.03983, 2016
19	J. Duchi, E. Hazan & Y. Singer, "Adaptive subgradient methods for online learning and stochastic optimization," Journal of Machine Learning Research, pp. 2121-2159, Jul. 2011
20	M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean & M. Kudlur, "Tensorflow: A system for large-scale machine learning," 12th USENIX Symposium on Operating Systems Design and Implementation, pp. 265-283, 2016
21	A. Nagrani, J. S. Chung & A. Zisserman, "Voxceleb: a large-scale speaker identification dataset," arXiv preprint arXiv:1706.08612, 2017
22	Y. H. Chen, I. Lopez-Moreno, T. N. Sainath, M. Visontai, R. Alvarez & C. Rarada, "Locally-connected and convolutional neural networks for small footprint speaker recognition," Sixteenth Annual Conference of the International Speech Communication Association, 2015
23	H. Salehghaffari, "Speaker Verification using Convolutional Neural Networks," arXiv, August. 2018

KSCI

Speaker Verification Model Using Short-Time Fourier Transform and Recurrent Neural Network STFT와 RNN을 활용한 화자 인증 모델

Speaker Verification Model Using Short-Time Fourier Transform and Recurrent Neural Network