DOI QR코드

DOI QR Code

Segment unit shuffling layer in deep neural networks for text-independent speaker verification

문장 독립 화자 인증을 위한 세그멘트 단위 혼합 계층 심층신경망

  • 허정우 (서울시립대학교 컴퓨터과학부) ;
  • 심혜진 (서울시립대학교 컴퓨터과학부) ;
  • 김주호 (서울시립대학교 컴퓨터과학부) ;
  • 유하진 (서울시립대학교 컴퓨터과학부)
  • Received : 2021.02.15
  • Accepted : 2021.03.15
  • Published : 2021.03.31

Abstract

Text-Independent speaker verification needs to extract text-independent speaker embedding to improve generalization performance. However, deep neural networks that depend on training data have the potential to overfit text information instead of learning the speaker information when repeatedly learning from the identical time series. In this paper, to prevent the overfitting, we propose a segment unit shuffling layer that divides and rearranges the input layer or a hidden layer along the time axis, thus mixes the time series information. Since the segment unit shuffling layer can be applied not only to the input layer but also to the hidden layers, it can be used as generalization technique in the hidden layer, which is known to be effective compared to the generalization technique in the input layer, and can be applied simultaneously with data augmentation. In addition, the degree of distortion can be adjusted by adjusting the unit size of the segment. We observe that the performance of text-independent speaker verification is improved compared to the baseline when the proposed segment unit shuffling layer is applied.

문장 독립 화자 인증 연구에서는 일반화 성능 향상을 위해 문장 정보와 독립적인 화자 특징을 추출하는 것이 필수적이다. 그렇지만 심층 신경망은 학습 데이터에 의존적이므로, 동일한 시계열 정보를 반복 학습할 경우, 화자 정보를 학습하는 대신 문장 정보에 과적합 될 수 있다. 본 논문에서는 이러한 과적합을 방지하기 위해 시간 축으로 입력층 혹은 은닉층을 분할 및 무작위 재배열하여 시계열 정보의 순서를 뒤섞는 세그멘트 단위 혼합 계층을 제안한다. 세그멘트 단위 혼합 계층은 입력층 뿐만 아니라 은닉층에도 적용이 가능하므로, 입력층에서의 일반화 기법에 비해 효과적이라 알려진 은닉층에서의 일반화 기법으로 활용이 가능하며, 기존의 데이터 증강 방법과 동시에 적용할 수도 있다. 뿐만아니라, 세그멘트의 단위 크기를 조절하여 혼합의 정도를 조절할 수도 있다. 본 논문에서는 제안한 방법을 적용하여 문장 독립 화자 인증 성능이 개선됨을 확인하였다.

Keywords

References

  1. J. -w. Jung, H.-j. Shim, J.-h. Kim, and H.-J. Yu, "αfeature map scaling for raw waveform speaker verification" (in Korean), J. Acoust. Soc. Kr. 39, 441-446 (2020).
  2. D. Snyder, P. Ghahremani, D. Povey, D. GarciaRomero, Y. Carmie, and S. Khudanpur, "Deep neural network-based speaker embeddings for end-to-end speaker verification," Proc. IEEE SLT. 165-170 (2016).
  3. D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, "Deep neural network embeddings for text-independent speaker verification," Proc. Interspeech, 999-1003 (2017).
  4. E. Varian, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, "Deep neural networks for small footprint text-dependent speaker verification," Proc. ICASSP. 4080-4084 (2014).
  5. S. Shon, H. Tang, and J. R. Glass, "Frame-level speaker embeddings for text-independent speaker recognition and analysis of end-to-end model," Proc. IEEE SLT. 1007-1013 (2018).
  6. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Dropout: A simple way to prevent neural networks from overfitting," JMLR. 15, 1929-1958 (2014).
  7. S. Ioffe and C. Szegedy, "Batch normalization: Accelerating deep network training by reducing internal covariate shift," Proc. PMLR. 448-456 (2015).
  8. D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, "Specaugment: A simple data augmentation method for automatic speech recognition," arXiv preprint arXiv:1904.08779 (2019).
  9. T. Inoue, P. Vinayavekhin, S. Wang, D. Wood, N. Greco, and R. Tachibana, "Shuffling and mixing data aug-mentation for environmental sound classification," Proc. of the DCASE. 109-113 (2019).
  10. I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning (MIT press, Cambridge, 2016), pp. 236-239.
  11. D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, "X-vectors: Robust dnn embeddings for speaker recognition," Proc. ICASSP. 5329-5333 (2018).
  12. Y. Xu, R. Jia, L. Mou, G. Li, Y. Chen, Y. Lu, and Z. Jin, "Improved relation classification by deep recurrent neural networks with data augmentation," arXiv preprint arXiv:1601.03651 (2016).
  13. Z. Wu, S. Wang, Y. Qian, and K. Yu, "Data augmentation using variational autoencoder for embedding based speaker verification," Proc. Interspeech, 1163-1167 (2019).
  14. L. Perez and J. Wang, "The effectiveness of data augmentation in image classification using deep learning," arXiv preprint arXiv:1712.04621 (2017).
  15. J. Hu, L. Shen, and G. Sun, "Squeeze-and-excitation networks," Proc. CVPR. 7132-7141 (2018).