Browse > Article
http://dx.doi.org/10.13089/JKIISC.2019.29.6.1393

Speaker Verification Model Using Short-Time Fourier Transform and Recurrent Neural Network  

Kim, Min-seo (Graduate School of Information Security, Korea University)
Moon, Jong-sub (Graduate School of Information Security, Korea University)
Abstract
Recently as voice authentication function is installed in the system, it is becoming more important to accurately authenticate speakers. Accordingly, a model for verifying speakers in various ways has been suggested. In this paper, we propose a new method for verifying speaker verification using a Short-time Fourier Transform(STFT). Unlike the existing Mel-Frequency Cepstrum Coefficients(MFCC) extraction method, we used window function with overlap parameter of around 66.1%. In this case, the speech characteristics of the speaker with the temporal characteristics are studied using a deep running model called RNN (Recurrent Neural Network) with LSTM cell. The accuracy of proposed model is around 92.8% and approximately 5.5% higher than that of the existing speaker certification model.
Keywords
Speaker verification; STFT; Deep Learning; Recurrent Neural Network(RNN);
Citations & Related Records
연도 인용수 순위
  • Reference
1 F. Bimbot, J. Bonastre, C. Fredouille, G. Gravier, I. Magrin-Chagnolleau, S. Meignier & D. Reynolds, "A tutorial on text-independent speaker verification," EURASIP Journal on Advances in Signal Processing, Apr. 2004
2 J. Hai & E. M. Joo, "Improved linear predictive coding method for speech recognition," Fourth International Conference on Information, vol. 3, pp. 1614-1618, Dec. 2003
3 L. Muda, M. Begam & I. Elamvazuthi, "Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques," Journal of Computing, vol. 2, Issue 3, Mar. 2010
4 H. Hermansky, "Perceptual linear predictive (PLP) analysis of speech," Journal of the Acoustical Society of America, vol. 87, pp. 1738-1752, 1990   DOI
5 T. Baba, "Time-Frequency Analysis Using Short Time Fourier Transform," Open Acoustics Journal, 2012
6 E. Variani, X. Lei, E. XDermott, I. L. Moreno & J. Gonzalez-Dominguez, "Deep neural networks for small footprint text-dependent speaker verification," IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP) IEEE, pp. 4052-4056, May. 2014
7 H. Sak, A. Senior & F. Beaufays, "Long short-term memory recurrent neural network architectures for large scale acoustic modeling," Fifteenth annual conference of the international speech communication association, 2014
8 C. Van Loan, "Computational frameworks for the fast Fourier transform," vol. 10, 1992
9 L. Liu, J. He & G. Palm, "Effects of phase on the perception of intervocatic stop consonants," Speech Communication, vol. 22, pp. 403-417, Sep. 1997   DOI
10 Z. Zhang & A. Subramaya, "Text-dependent speaker verification," U.S. Patent no. 8, 2012
11 K. Paliwal & L. Alsteris, "Usefulness of phase spectrum in human speech perception," EUROSPEECH, pp. 21187-2120, Sep. 2003
12 D. P. Kingma & J. Ba, "Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980, 2014
13 L. D. Alsteris & K. Paliwal, "Importance of window shape for phase-only reconstruction of speech," IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP) IEEE, pp. 573-576, May. 2004
14 H. Georg, M. Ignacio, B. Samy & S. Noam, "End-to-end text-dependent sp eaker verification." IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP) IEEE, pp. 5115-5119, 2016
15 T. Amirsina, D. Jeremy & M. Nasser, "Text-Independent Speaker Verification Using 3D CNN," arXiv, June. 2018
16 L. D. Alsteris & K. Paliwal, "Further intelligibility results from human listening tests using the short-time phase spectrum," Speech Communication, vol. 48, pp. 727-736, Jun. 2006   DOI
17 S. Kanai, Y. Fujiwara, Y. Yamanaka & S. Adachi, "Sigsoftmax: Reanalysis of the softmax bottleneck," Advances in Neural Information Processing Systems, pp. 286-296, 2018
18 I. Loshchilov & F. Hutter, "Sgdr: Stochastic gradient descent with warm restarts," arXiv preprint arXiv:1608.03983, 2016
19 J. Duchi, E. Hazan & Y. Singer, "Adaptive subgradient methods for online learning and stochastic optimization," Journal of Machine Learning Research, pp. 2121-2159, Jul. 2011
20 M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean & M. Kudlur, "Tensorflow: A system for large-scale machine learning," 12th USENIX Symposium on Operating Systems Design and Implementation, pp. 265-283, 2016
21 A. Nagrani, J. S. Chung & A. Zisserman, "Voxceleb: a large-scale speaker identification dataset," arXiv preprint arXiv:1706.08612, 2017
22 Y. H. Chen, I. Lopez-Moreno, T. N. Sainath, M. Visontai, R. Alvarez & C. Rarada, "Locally-connected and convolutional neural networks for small footprint speaker recognition," Sixteenth Annual Conference of the International Speech Communication Association, 2015
23 H. Salehghaffari, "Speaker Verification using Convolutional Neural Networks," arXiv, August. 2018