References
- Arik, S., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., Li, X., Miller, J., Ng, A., Raiman, J., Sengupta, S., & Shoeybi, M. (2017a). Deep Voice: Real-time neural text-to-speech. Proceedings of the 34th International Conference on Machine Learning (pp. 195-204). Sydney, AU. 6-11 August, 2017.
- Arik, S., Diamos, G., Gibiansky, A., Miller, J., Peng, K., Ping, W., Raiman, J., & Zhou, Y. (2017b). Deep Voice 2: Multi-speaker neural text-to-speech. Advances in Neural Information Processing Systems 30 (pp. 2966-2974). Long Beach, CA. 4-9 December, 2017.
- Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. Retrieved from http://arxiv.org/abs/1409.0473 [Computing Research Repository] on January 9, 2018.
- Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). Curriculum learning. Proceedings of the 26th Annual International Conference on Machine Learning (pp. 41-48). 14-18 June, 2009.
- Cho, K., Van Mrrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. Retrieved from http://arxiv.org/abs/1406.1078 [Computing Research Repository] on January 9, 2018.
- Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. Retrieved from http://arxiv.org/abs/1412.3555 [Computing Research Repository] on January 9, 2018.
- Collins, J., Sohl-Dickstein, J., & Sussillo, D. (2017). Capacity and trainability in recurrent neural networks. Proceedings of the 5th International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=BydARw9ex on January 9, 2018.
- Griffin, D., & Lim, J. (1984). Signal estimation from modified short-time fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2), 236-243. https://doi.org/10.1109/TASSP.1984.1164317
- He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 770-778). 26 June-1 July, 2016.
- Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735
- Hunt, A. J., & Black, A. W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing (pp. 373-376). 7-10 May, 1996.
- Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the 32nd International Conference on Machine Learning (pp. 448-456). 2 Mar, 2015.
- Kawahara, H. (1997). Speech representation and transformation using adaptive interpolation of weighted spectrum: Vocoder revisited. Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (pp. 1303-1306). 21-24 April, 1997.
- Lee, J., Cho, K., & Hoffman, T. (2016). Fully character-level neural machine translation without explicit segmentation. Retrieved from http://arxiv.org/abs/1610.03017 [Computing Research Repository] on January 9, 2018.
- Morise, M., Yokomori, F., & Ozawa, K. (2016). WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems, 99(7), 1877-1884.
- Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., & Kavukcuoglu, K. (2016). WaveNet: A generative model for raw audio. Retrieved from http://arxiv.org/abs/1609.03499 [Computing Research Repository] on January 9, 2018.
- Ping, W., Peng, K., Gibiansky, A., Arik, S., Kannan, A., Narang, S., Raiman, J., & Miller, J. (2017). Deep voice 3: Scaling text-to-speech with convolutional sequence learning. Retrieved from http://arxiv.org/abs/1710.07654 [Computing Research Repository] on January 9, 2018.
- Rabiner, L., & Schafer, R. (2011). Theory and applications of digital speech processing. New Jersey: Pearson.
- Raffel, C., Luong, M.-T., Liu, P., Weiss, R., & Eck, D. (2017). Online and linear-time attention by enforcing monotonic alignments. Proceedings of the 34th International Conference on Machine Learning (pp. 2837-2846). 6-11 August, 2017.
- Shen, J., Pang, R., Weiss, R., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerry-Ryan, R., Saurous, R., Agiomyrgiannakis, Y., & Wu, Y. (2017). Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. Retrieved from http://arxiv.org/abs/1712.05884 [Computing Research Repository] on March 1, 2018.
- Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929-1958.
- Srivastava, R., Greef, K., & Schmidhuber, J. (2015). Highway networks. Retrieved from http://arxiv.org/abs/1505.00387 [Computing Research Repository] on January 9, 2018.
- Sutskever, I., Vinyals, O., & Le, Q. (2014). Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems 27 (pp. 3104-3112). 8-13 December, 2014.
- Tokuda, K., Nankaku, Y., Toda, T., Zen, H., Yamagishi, J., & Oura, K. (2013). Speech synthesis based on hidden markov models. Proceedings of IEEE, 101(5), 1234-1252. https://doi.org/10.1109/JPROC.2013.2251852
- Vinyals, O., Kaiser, L., Koo, T., Petrov, S., Sutskever, I., & Hinton, G. (2015). Grammar as a foreign language. Advances in Neural Information Processing Systems 28 (pp. 2773-2781). 7-12 December, 2015.
- Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., Le, Q., Agiomyrgiannakis, Y., Clark, R., & Saurous, R. (2017). Tacotron: Towards end-to-end speech synthesis. Retrieved from http://arxiv.org/abs/1703.10135 [Computing Research Repository] on January 9, 2018.
- Wu, Z., Watts, O., & King, S. (2016). Merlin: An open source neural network speech synthesis system. Proceedings of the 9th ISCA Speech Synthesis Workshop (pp. 218-223). Sunnyvale, CA. 13-15 September, 2016.
Cited by
- Corpus-based evaluation of French text normalization vol.10, pp.3, 2018, https://doi.org/10.13064/KSSS.2018.10.3.031