과제정보
이 논문은 2019년교육부와 한국연구재단의 지원을 받아 수행된 연구임(NRF-2019S1A5A2A03045884)
참고문헌
- H. Hu, M. Xu, and W. Wu, "GMM supervector based SVM with spectral features for speech emotion recognition," Proc. ICASSP. 413-416 (2007).
- A. Stuhlsatz, C. Meyer, F. Eyben, T. Zielke, G. Meier, and B. Schuller, "Deep neural networks for acoustic emotion recognition: Raising the benchmarks," Proc. ICASSP. 5688-5691 (2011).
- G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, B. Schuller, and S. Zafeiriou, "Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network," Proc. ICASSP. 5200-5204 (2016).
- S. Mirsamadi, E. Barsoum, and C. Zhang, "Automatic speech emotion recognition using recurrent neural networks with local attention," Proc. ICASSP. 2227-2231 (2017).
- J. Kim, G. Englebienne, K. P. Truong, and V. Eversu, "Towards speech emotion recognition "in the Wild" using aggregated corpora and deep multi-task learning," Proc. Interspeech, 1113-1117 (2017).
- S. Yoon, S. Byun, and K. Jung, "Multimodal speech emotion recognition using audio and text," Proc. SLT. 112-118 (2018).
- Z. Lu, L. Cao, Y. Zhang, C. Chiu, and J. Fan, "Speech sentiment analysis via pre-trained features from end-to-end ASR models," Proc. ICASSP. 7149-7153 (2020).
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, and L.Kaiser, "Attention is all you need," Proc. NIPS. 6000-6010 (2017).
- J. Devlin, M. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of deep bidirectional transformers for language understanding," Proc. NAACL-HLT. 4171-4186 (2019).
- A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, "Wav2vec 2.0: A framework for self-supervised learning of speech representations," Proc. NeurIPS. 12449-12460 (2020).
- C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S.l Kim, J. N. Chang, S. Lee, and S. S. Narayanan, "IEMOCAP: interactive emotional dyadic motion capture database," Language Resources and Evaluation, 42, 335-359 (2008). https://doi.org/10.1007/s10579-008-9076-6
- V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, "Librispeech: An ASR corpus based on public domain audio books," Proc. ICASSP. 5206-5210 (2015).
- W. Chan, N. Jaitly, Q. Le, and O. Vinyals, "Listen, attend and spell: A neural network for large vocabulary conversational speech recognition," Proc. ICASSP. 4960-4964 (2016).
- A. Graves, A. Mohamed, and G. Hinton, "Speech recognition with deep recurrent neural networks," Proc. ICASSP. 6645-6649 (2013).
- A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, "Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks," Proc. ICML. 369-376 (2006).
- S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, "Hybrid CTC/attention architecture for end-to-end speech recognition," IEEE JSTSP. 11, 1240-1253 (2017).
- T. Kudo and J. Richardson, "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing," Proc. EMNLP 66-71 (2018).
- A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, "PyTorch: An imperative style, high-performance deep learning library," Proc. NeurIPS. 8024-8035 (2019).