References
- D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, "Speaker verification using adapted Gaussian mixture models," Digital Signal Processing, 10, 19-41 (2000). https://doi.org/10.1006/dspr.1999.0361
- W. M. Campbell, D. E. Sturim, D. A. Reynolds, and A. Solomonoff, "SVM based speaker verification using a GMM supervector kernel and NAP variability compensation," Proc. ICASSP. 97-100 (2006).
- P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouche, "Factor analysis simplified speaker verification applications," Proc. ICASSP. 637-640 (2005).
- N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, "Front-end factor analysis for speaker verification," IEEE Trans. on Audio, Speech, and Language Processing, 19, 788-798 (2011). https://doi.org/10.1109/TASL.2010.2064307
- D. Garcia-Romero and C. Y. Espy-Wilson, "Analysis of i-vector length normalization in speaker recognition systems," Proc. Interspeech, 249-252 (2011).
- E. Variani, X. Lei, E. McDermott, I. Lopez-Moreno, and J. Gonzalez-Dominguez, "Deep neural networks for small footprint text-dependent speaker verification," Proc. ICASSP. 4052-4056 (2014).
- D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, "X-vectors: Robust DNN embeddings for speaker recognition," Proc. ICASSP. 5329-5333 (2018).
- K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," Proc. CVPR. 770-778 (2016).
- W. Cai, J. Chen, and M. Li, "Exploring the encoding layer and loss function in end-to-end speaker and language recognition system," Proc. Odyssey, 74-81 (2018).
- W. Cai, J. Chen, and M. Li, "Analysis of length normalization in end-to-end speaker verification system," Proc. Interspeech, 3618-3622 (2018).
- W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman, "Utterance-level aggregation for speaker recognition in the wild," Proc. ICASSP. 5791-5795 (2019).
- I. Kim, K. Kim, J. Kim, and C. Choi, "Deep representation using orthogonal decomposition and recombination for speaker verification," Proc. ICASSP. 6126-6130 (2019).
- Y. Jung, Y. Kim, H. Lim, Y. Choi, and H. Kim, "Spatial pyramid encoding with convex length normalization for text-independent speaker verification," Proc. Interspeech, 4030-4034 (2019).
- S. Seo, D. J. Rim, M. Lim, D. Lee, H. Park, J. Oh, C. Kim, and J. Kim, "Shortcut connections based deep speaker embeddings for end-to-end speaker verification system," Proc. Interspeech, 2928-2932 (2019).
- F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang, "Residual attention network for image classification," Proc. CVPR. 3156-3164 (2017).
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," Proc. NeurIPS. 5998-6008 (2017).
- Z. Lin, M. Feng, C. N. Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio, "A structured self-attentive sentence embedding," Proc. ICLR (2017).
- K. Lee, X. Chen, G. Hua, H. Hu, and X. He, "Stacked cross attention for image-text matching," Proc. ECCV. 201-216 (2018).
- L. Bao, B. Ma, H. Chang, and X. Chen, "Masked graph attention network for person re-identification," Proc. CVPR. (2019).
- S. Zhang, Z. Chen, Y. Zhao, J. Li, and Y. Gong, "Endto- end attention based text-dependent speaker verification," Proc. SLT. 171-178 (2016).
- G. Bhattacharya, J. Alam, and P. Kenny, "Deep speaker embeddings for short-duration speaker verification," Proc. Interspeech, 1517-1521 (2017).
- F. R. Chowdhury, Q. Wang, I. L. Moreno, and L. Wan, "Attention-based models for text-dependent speaker verification," Proc. ICASSP. 5359-5363 (2018).
- K. Okabe, T. Koshinaka, and K. Shinoda, "Attentive statistics pooling for deep speaker embedding," Proc. Interspeech, 2252-2256 (2018).
- Y. Zhu, T. Ko, D. Snyder, B. Mak, and D. Povey, "Self-attentive speaker embeddings for text-independent speaker verification," Proc. Interspeech, 3573-3577 (2018).
- M. India, P. Safari, and J. Hernando, "Self multi-head attention for speaker recognition," Proc. Interspeech, 4305-4309 (2019).
- J. S. Chung, A. Nagrani, and A. Zisserman, "VoxCeleb2: Deep speaker recognition," Proc. Interspeech, 1086-1090 (2018).
- A. Nagrani, J. S. Chung, and A. Zisserman, "VoxCeleb: A large-scale speaker identification dataset," Proc. Interspeech, 2616-2620 (2017).
- A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, "Pytorch: An imperative style, high-performance deep learning library," Proc. NeurIPS. 8024-8035 (2019).