Acknowledgement
본 연구는 2023년 정부(교육부)의 재원으로 한국연구재단 기초연구사업의 지원을 받아 수행된 연구임(No. NRF-2021R1F1A1049202).
References
- Y. Shalev and L. Wolf, "End to end lip synchronization with a temporal autoencoder," Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2020.
- K. R. Prajwal, R. Mukhopadhyay, V. P. Namboodiri, and C. V. Jawahar, "A lip sync expert is all you need for speech to lip generation in the wild," Proceedings of the 28th ACM International Conference on Multimedia, pp.484-492, 2020.
- P. Ma, S. Petridis, and M. Pantic, "End-to-end audio-visual speech recognition with conformers," In: ICASSP 2021- 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp.7613-7617, 2021.
- T. Makino et al., "Recurrent neural network transducer for audio-visual speech recognition," In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE, pp.905-912, 2019.
- A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, "Transformers are rnns: Fas autoregressive transformers with linear attention," In: International Conference on Machine Learning. PMLR, pp.5156-5165, 2020.
- V. S. Kadandale, J. F. Montesinos, and G. Haro, "VocaLiST: An audio-visual synchronisation model for lips and voices," In: Interspeech, pp.3128-3132, 2022.
- J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, "Lip Reading Sentences in the Wild," In: IEEE Conference on Computer Vision and Pattern Recognition, 2017.
- T. Afouras, J. S. Chung, and A. Zisserman. "LRS3-TED: a large-scale dataset for visual speech recognition," In: arXiv preprint arXiv:1809.00496, 2018.
- S. Chopra, R. Hadsell, and Y. LeCun, "Learning a similarity metric discriminatively, with application to face verification," In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), IEEE. Vol.1, pp.539-546, 2005.
- B. V. Mahavidyalaya. "Phoneme and viseme based approach for lip synchronization," International Journal of Signal Processing, Image Processing and Pattern Recognition, Vol.7, No.3, pp.385-394, 2014. https://doi.org/10.14257/ijsip.2014.7.3.31
- J. S. Chung and A. Zisserman, "Out of time: automated lip sync in the wild," In: Workshop on Multi-view Lipreading, ACCV. 2016.
- Y. J. Kim, H. S. Heo, S. W. Chung, and B. J. Lee, "End-to-end lip synchronisation based onpattern classification," In: 2021 IEEE Spoken Language Technology Workshop (SLT), IEEE, pp.598-605, 2021.
- A. Dosovitskiy et al., "An image is worth 16x16 words: Transformers for image recognition at scale," arXiv preprint arXiv:2010.11929, 2020.
- A. Bulat and G. Tzimiropoulos. "How far are we from solving the 2D & 3D Face Alignment problem? (and a dataset of 230,000 3D facial landmarks)," In: International Conference on Computer Vision, 2017.
- Y. Zhang, S. Yang, J. Xiao, S. Shan, and X. Chen, "Can we read speech beyond the lips? rethinking roi selection for deep visual speech recognition," In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), IEEE, pp.356-363, 2020
- K. R. Prajwal, T. Afouras, and A. Zisserman, "Sub-word level lip reading with visual attention," In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.5162-5172, 2022.
- A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, "Transformers are rnns: Fast autoregressive transformers with linear attention," In: International Conference on Machine Learning, PMLR, pp.5156-5165, 2020.
- Y.-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L-P Morency, and R. Salakhutdinov, "Multimodal transformer for unaligned multimodal language sequences," In: Proceedings of the Conference Association for Computational Linguistics, Meeting, NIH Public Access, pp.6558, 2019.
- J. F. Montesinos, V. S. Kadandale, and G. Haro, "Acappella: audio-visual singing voice separation," In: 32nd British Machine Vision Conference, BMVC 2021, 2021.
- S.-W. Chung, J. S. Chung, and H.-G. Kang. "Perfect match: Improved cross-modal embeddings for audio-visual synchronisation," In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp.3965-3969, 2019.
- H. Gupta, "Perceptual synchronization scoring of dubbed content using phoneme-viseme agreement," Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024.