Synthesis of Expressive Talking Heads from Speech with Recurrent Neural Network |
Sakurai, Ryuhei
(Ritsumeikan University)
Shimba, Taiki (Ritsumeikan University) Yamazoe, Hirotake (Ritsumeikan University) Lee, Joo-Ho (Ritsumeikan University) |
1 | O.-W. Kwon, K. Chan, J. Hao, and T.-W. Lee, "Emotion Recognition by speech signals," 8th European Conference on Speech Communication and Technology, Geneva, Switzerland, pp. 125-128, 2003. |
2 | Y. Pan, P. Shen, and L. Shen, "Speech emotion recognition using support vector machine," International Journal of Smart Home, vol. 6, no. 2, pp. 101-107, April, 2012. |
3 | S. Furui, "Speaker-independent isolated word recognition using dynamic features of speech spectrum," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 34, no. 1, pp. 52-59, Feb., 1986. DOI |
4 | S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X.A. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK book, Microsoft Corporation, 1995. |
5 | S. Imai, T. Kobayashi, K. Tokuda, T. Masuko, K. Koishida, S. Sako, and H. Zen, Speech signal processing toolkit (SPTK), [Online], http://sp-tk.sourceforge.net/, Accessed: Feb. 14, 2018. |
6 | I. Matthews and S. Baker, "Active appearance models revisited," International Journal of Computer Vision, vol. 60, no. 2, pp. 135-164, Nov., 2004. DOI |
7 | J. Alabort-i-Medina, E. Antonakos, J. Booth, P. Snape, and S. Zafeiriou, "Menpo: a comprehensive platform for parametric image alignment and visual deformable models," 22nd ACM international conference on Multimedia, Orlando, Florida, USA, pp. 679-682, 2014. |
8 | D.W. Massaro, "Symbiotic value of an embodied agent in language learning," 37th Hawaii International Conference on System Sciences, Big Island, HI, USA, 2004, doi: 10.1109/ HICSS.2004.1265333. DOI |
9 | B. Fan, L. Wang, F.K. Soong, and L. Xie, "Photo-real t alking head with deep bidirectional LSTM", International Conference on Acoustics, Speech, and Signal Processing, Brisbane, QLD, Australia, 2015, doi: 10.1109/ICASSP.2015.7178899. DOI |
10 | L. Wang, and F.K. Soong, "HMM trajectory-guided sample selection for photo-realistic talking head," Multimedia Tools and Applications, vol. 74, no. 22, pp. 9849-9869, Nov., 2014. DOI |
11 | M. Schuster, and K.K. Paliwal, "Bidirectional recurrent neural networks," IEEE Transactions on Signal Processing, vol. 45, no. 11, Nov., 1997. |
12 | B.D. Lucas, and T. Kanade, "An iterative image registration technique with an application to stereo vision," 1981 DARPA Image Understanding Workshop, pp. 121-130, April 1981. |
13 | T. Tieleman, and G. Hinton, "Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude", COURSERA: Neural Networks for Machine Learning, vol. 4, pp. 26-30, 2012. |
14 | A. Karpathy, J. Johnson, and L. Fei-Fei, "Visualizing and understanding recurrent networks", arXiv:1506.02078, 2015. |
15 | E. Cosatto, and H.P. Graf, "Sample-based synthesis of photo realistic talking heads," Computer Animation 98, Philadelphia, PA, USA, USA, pp. 103-110, 1998. |
16 | V. Wan, R. Blokland, N. Braunschweiler, L. Chen, B. Kolluru, J. Latorre, R. Maia, B. Stenger, K. Yanagisawa, Y. Stylianou, M. Akamine, M.J.F. Gales, and R. Cipolla, "Photo-Realistic Expressive Text to Talking Head Synthesis," 14 th Annual Conference of the International Speech Communication Association, Lyon, France, pp. 2667-2669, 2013. |
17 | S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber, "Gradient flow in recurrent nets: the difficulty of learning long-term dependencies," A field guide to dynamical recurrent neural networks, IEEE Press, 2001. |
18 | A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, and J. Schmidhuber, "A novel connectionist system for unconstrained handwriting recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, May 2009. |
19 | H. Sanaul and P.J.B. Jackson, "Multimodal Emotion Recognition," W. Wang ed., Machine Audition: Principles, Algorithms and Systems, Hershey, PA: IGI Global, 2011, pp. 398-423, doi: 10.4018/978-1-61520-919-4.ch017. DOI |
20 | R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker, "Multi-PIE," Image and Vision Computing, vol. 28, no. 5, pp. 807-813, May, 2010. DOI |
21 | N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov, "Dropout: A Simple Way to Prevent Neural Networks from Overfitting," Journal of Machine Learning Research, Vol. 15, pp. 1929-1958, Jun., 2014. |
22 | W. Han, L. Wang, F. Soong, and B. Yuan, "Improved minimum converted trajectory error training for real-time speech-to-lips conversion," 2012 IEEE International Conference on Acoustics, Speech and Signal Processing, Kyoto, Japan, 2012, doi: 10.1109/ICASSP.2012.6288921. DOI |
23 | D. W. Massaro, J. B eskow, M. M. C ohen, C. L . Fry, a nd T. Rodgriguez, "Picture My Voice: Audio to Visual Speech Synthesis using Artificial Neural Networks", Auditory-Visual Speech Processing, Santa Cruz, CA, USA, pp. 133-138, 1999. |