과제정보
We would like to thank the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea Government (MSIT) (no. 2022-0-00608; Development of artificial intelligence technology of multimodal interaction for empathetic and social conversations with humans).
참고문헌
- K. K. Bowden, S. Oraby, A. Misra, J. Wu, S. Lukin, and M. Walker, Data-driven dialogue systems for social agents, (8th International Workshop on Spoken Dialog Systems, PA, USA), 2017.
- P. Fung, D. Bertero, Y. Wan, A. Dey, R. H. Y. Chan, F. B. Siddique, Y. Yang, C.-S. Wu, and R. Lin, Towards empathetic human-robot interactions, (Proceedings of 17th International Conference on Intelligent Text Processing and Computational Linguistics, Konya, Turkiye), 2016.
- S. Iwasaki, The northridge earthquake conversations: the floor structure and the 'loop' sequence in japanese conversation, J. Pragm. 28 (1997), no. 6, 661-693. https://doi.org/10.1016/S0378-2166(97)00070-2
- M. Barange, S. Rasendarasoa, M. Bouabdelli, J. Saunier, and A. Pauchet, Impact of adaptive multimodal empathic behavior on the user interaction, (Proceedings of the 22nd ACM International Conference on Intelligent Virtual Agents, Faro, Portugal), 2022, pp. 1-8.
- L. Hunag, L.-P. Morency, and J. Gratch, Virtual rapport 2.0, (Proceedings of the 10th ACM international conference on intelligent virtual agents, Reykjavik, Iceland), 2011, pp. 68-79.
- A. I. Adiba, T. Homma, and T. Miyoshi, Towards immediate backchannel generation using attention-based early prediction model, (Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, Ontario, Canada), 2021, pp. 7408-7412.
- J. Y. Jang, S. Kim, M. Jung, S. Shin, and G. Gweon, BPM_MT: Enhanced backchannel prediction model using multi-task learning, (Proceedings of the Conference on Empirical Methods in Natural Language Processing), 2021, pp. 3447-3452.
- D. Ortega, C.-Y. Li, and N. T. Vu, Oh, jeez! or uh-huh? a listener-aware backchannel predictor on ASR transcriptions, (Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain), 2020, pp. 8064-8068.
- R. Ruede, Backchannel prediction for conversational speech using recurrent neural networks, Karlsruhe Institute of Technology, Institute for Anthropomatics and Robotics, Bachelor's thesis, 2017, pp. 1-52.
- A. Jain, A. Singh, H. S. Koppula, S. Soh, and A. Saxena, Recurrent neural networks for driver activity anticipation via sensory-fusion architecture, (Proceedings of International Conference on Robotics and Automation, Stockholm, Sweden), 2016, pp. 3118-3125.
- T. Suzuki, H. Kataoka, Y. Aoki, and Y. Satoh, Anticipating traffic accidents with adaptive loss and large-scale incident DB, (Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA), 2018, pp. 3521-3529.
- S. Ruder, An overview of multi-task learning in deep neural networks, 2017. Available from: https://catalog.ldc.upenn.edu/LDC97S62 [last accessed Augst 2023].
- A. Graves, Sequence transduction with recurrent neural networks, (Workshop on representation learning, Edinburgh, Scotland), 2012.
- A. Graves, A. Mohamed, and G. Hinton, Speech recognition with deep recurrent neural networks, (IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, Canada), 2013, DOI 10.1109/ICASSP.2013.6638947
- Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, A. Kannan, Y. Wu, R. Pang, Q. Liang, D. Bhatia, Y. Shangguan, B. Li, G. Pundak, K. C. Sim, T. Bagby, S.-Y. Chang, K. Rao, and A. Gruenstein, Streaming end-to-end speech recognition for mobile devices, (Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processin, Vancouver, Canada), 2013.
- C.-C. Chiu and C. Raffel, Monotonic chunkwise attention, (Proceedings of International Conference on Learning Representations, Vancouver, Canada), 2018.
- J. Hou, S. Zhang, and L. Dai, Gaussian prediction based attention for online end-to-end speech recognition, (Proceedings of Annual Conference of the International Speech Communication Association, Stockholm, Sweden), 2017, pp. 3692-3696.
- N. Moritz, T. Hori, and J. L. Roux, Triggered attention for end-to-end speech recognition, (Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processin, Brighton, UK), 2019, DOI 10.1109/ICASSP.2019.8683510.
- L. Dong, F. Wang, and B. Xu, Self-attention aligner: a latencycontrol end-to-end model for ASR using self-attention network and chunk-hopping, (Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, UK), 2019, pp. 5656-5660.
- E. Tsunoo, Y. Kashiwagi, T. Kumakura, and S. Watanabe, Transformer ASR with contextual block processing, (Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop, Singapore), 2019, pp. 427-433.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need, (Proceedings of the 31st International Conference on Neural Information Processing Systems, CA, USA), 2017, pp. 6000-6010.
- E. Tsunoo, Y. Kashiwagi, and S. Watanabe, Streaming transformer ASR with blockwise synchronous beam search, (Proceedings of IEEE Spoken Language Technology Workshop, Virtual), 2021, pp. 22-29.
- J. J. Godfrey and E. Holliman, Switchboard-1 release 2 ldc97s62, 1993. Available from: https://arxiv.org/abs/1706.05098 [last accessed Augst 2023].
- D. Jurafsky, R. Bates, N. Coccaro, R. Martin, M. M. Bbn, K. Ries, E. S. Sri, A. S. Sri, P. Taylor, and C. V. E.-D. Dod, Switchboard discourse language modeling project final report, (Johns Hopkins LVCSR workshop-97), 1998.
- S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, Hybrid CTC/attention architecture for end-to-end speech recognition, J. Sel. Top. Sig. Process. 11 (2017), no. 8, 1240-1253. https://doi.org/10.1109/JSTSP.2017.2763455
- R. Sennrich, B. Haddow, and A. Birch, Neural machine translation of rare words with subword units, (Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany), 2016, pp. 1715-1725.
- S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen, and A. Renduchintala, ESPNET: end-to-end speech processing toolkit, (Proceedings of Annual Conference of the International Speech Communication Association, Graz, Austria), 2019, pp. 2207-2211.
- A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, Automatic differentiation in Pytorch, (Workshop on the nips autodiff, CA, USA), 2017.