Style Synthesis of Speech Videos Through Generative Adversarial Neural Networks

Choi, Hee Jo;Park, Goo Man;

doi:10.3745/KTSDE.2022.11.11.465

KIPS Transactions on Software and Data Engineering (정보처리학회논문지:소프트웨어 및 데이터공학)

Volume 11 Issue 11
/
Pages.465-472
/
2022
/
2287-5905(pISSN)
/
2734-0503(eISSN)

Korea Information Processing Society (한국정보처리학회)

DOI QR Code

Style Synthesis of Speech Videos Through Generative Adversarial Neural Networks

적대적 생성 신경망을 통한 얼굴 비디오 스타일 합성 연구

최희조 (서울과학기술대학교 IT미디어공학과) ;
박구만 (서울과학기술대학교 IT전자미디어공학과)

Received : 2021.12.29
Accepted : 2022.05.02
Published : 2022.11.30

https://doi.org/10.3745/KTSDE.2022.11.11.465 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

In this paper, the style synthesis network is trained to generate style-synthesized video through the style synthesis through training Stylegan and the video synthesis network for video synthesis. In order to improve the point that the gaze or expression does not transfer stably, 3D face restoration technology is applied to control important features such as the pose, gaze, and expression of the head using 3D face information. In addition, by training the discriminators for the dynamics, mouth shape, image, and gaze of the Head2head network, it is possible to create a stable style synthesis video that maintains more probabilities and consistency. Using the FaceForensic dataset and the MetFace dataset, it was confirmed that the performance was increased by converting one video into another video while maintaining the consistent movement of the target face, and generating natural data through video synthesis using 3D face information from the source video's face.

본 연구에서는 기존의 동영상 합성 네트워크에 스타일 합성 네트워크를 접목시켜 동영상에 대한 스타일 합성의 한계점을 극복하고자 한다. 본 논문의 네트워크에서는 동영상 합성을 위해 스타일갠 학습을 통한 스타일 합성과 동영상 합성 네트워크를 통해 스타일 합성된 비디오를 생성하기 위해 네트워크를 학습시킨다. 인물의 시선이나 표정 등이 안정적으로 전이되기 어려운 점을 개선하기 위해 3차원 얼굴 복원기술을 적용하여 3차원 얼굴 정보를 이용하여 머리의 포즈와 시선, 표정 등의 중요한 특징을 제어한다. 더불어, 헤드투헤드++ 네트워크의 역동성, 입 모양, 이미지, 시선 처리에 대한 판별기를 각각 학습시켜 개연성과 일관성이 더욱 유지되는 안정적인 스타일 합성 비디오를 생성할 수 있다. 페이스 포렌식 데이터셋과 메트로폴리탄 얼굴 데이터셋을 이용하여 대상 얼굴의 일관된 움직임을 유지하면서 대상 비디오로 변환하여, 자기 얼굴에 대한 3차원 얼굴 정보를 이용한 비디오 합성을 통해 자연스러운 데이터를 생성하여 성능을 증가시킴을 확인했다.

Keywords

Acknowledgement

이 논문은 2021년도 정부(과학기술정보통신부)의 재원으로 정보통신기획평가원의 지원을 받아 수행된 연구임(No.2017-0-00217, 투명도와 레이어 가변형 실감 사이니지 기술 연구).

References

T. Karras, S. Laine, and T. Aila, "A style-based generator architecture for generative adversarial networks," Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2019-June, 2019. https://doi.org/10.1109/CVPR.2019.00453.
T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, "Analyzing and improving the image quality of stylegan," Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2020. https://doi.org/10.1109/CVPR42600.2020.00813m.
M. J. Chong, "GANs N' roses: Stable, controllable, diverse image to image translation," Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2021.
L. Tran and X. Liu, "On learning 3D face morphable model from in-the-wild images," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.43, No.1, pp.157-171, 2021. https://doi.org/10.1109/TPAMI.2019.2927975.
T. C. Wang et al., "Video-to-video synthesis," Advances in Neural Information Processing Systems, 2018-December, 2018.
A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Niessner, "FaceForensics++: Learning to detect manipulated facial images," Proceedings of the IEEE International Conference on Computer Vision, 2019-October, 2019. https://doi.org/10.1109/ICCV.2019.00009.
M. R. Koujan, M. C. Doukas, A. Roussos, and S. Zafeiriou, "Head2Head: Video-based neural head synthesis," Proceedings - 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2020, 2020. https://doi.org/10.1109/FG47880.2020.00048.
M. C. Doukas, M. R. Koujan, V. Sharmanska, A. Roussos, and S. Zafeiriou, "Head2Head++: Deep facial attributes re-targeting," arXiv e-prints arXiv: 2006.10199, 2020.
MetFace dataset [Internet], https://github.com/NVlabs/metfaces-dataset.
R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, "The unreasonable effectiveness of deep features as a perceptual metric," Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2018. https://doi.org/10.1109/CVPR.2018.00068.
D. Bank, N, Koenigstein, and R. Giryes, "Autoencoder," arXiv preprint arXiv:2003.05991. 2020.
D. P. Kingma and M. Welling, "Auto-encoding variational bayes," 2nd International Conference on Learning Representations, ICLR 2014 - Conference Track Proceedings, 2014.
I. J. Goodfellow et al., "Generative adversarial nets," Advances in Neural Information Processing Systems, 3(Jan.), 2014. https://doi.org/10.3156/jsoft.29.5_177_2.
A. Radford, L. Metz, and S. Chintala, "Unsupervised representation learning with deep convolutional GANs," International Conference on Learning Representations, 2016.
X. Huang and S. Belongie, "Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization," Proceedings of the IEEE International Conference on Computer Vision, 2017. https://doi.org/10.1109/ICCV.2017.167.
X. Han, L. Zhang, K. Zhou, and X. Wang, "ProGAN: Protein solubility generative adversarial nets for data augmentation in DNN framework," Computers and Chemical Engineering, Vol.131, 2019. https://doi.org/10.1016/j.compchemeng.2019.106533.
M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley, "Stochastic variational inference," Journal of Machine Learning Research, Vol.14, 2013. https://doi.org/10.1184/R1/6475463.V1.
N.-A. Lahonce, Flickr-Faces-HQ Dataset (FFHQ), Nvidia, 2020.
Z. Zhang and M. R. Sabuncu, "Generalized cross entropy loss for training deep neural networks with noisy labels," Advances in Neural Information Processing Systems, Vol.31, 2018.
D. A. Pisner and D. M. Schnyer, "Support vector machine," In Machine Learning: Methods and Applications to Brain Disorders, 2019. https://doi.org/10.1016/B978-0-12-815739-8.00006-7.
A. Mathiasen and F. Hvilshoj, "Fast frechet inception distance," arXiv preprint arXiv:2009.14075, 2020.
O. Nizan and A. Tal, "Breaking the cycle-colleagues are all you need," Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2020. https://doi.org/10.1109/CVPR42600.2020.00788.
P. Isola, J. Y. Zhu, T. Zhou, and A. A. Efros, "Pix2Pix," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
J. Y. Zhu, T. Park, P. Isola, and A. A. Efros, "Unpaired image-to-image translation using cycle-consistent adversarial networks," arXiv preprint arXiv:1703.10593, 2017.
J. Han, J. Tao, and C. Wang, "FlowNet: A deep learning framework for clustering and selection of streamlines and stream surfaces," in IEEE Transactions on Visualization and Computer Graphics, Vol.26, No.4, pp.1732-1744, 2020, doi: 10.1109/TVCG.2018.2880207.
E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, "FlowNet 2.0: Evolution of optical flow estimation with deep networks," Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 2017. https://doi.org/10.1109/CVPR.2017.179.
L. Yuan, C. Ruan, H. Hu, and D. Chen, "Image inpainting based on Patch-GANs," in IEEE Access, Vol.7, pp.46411- 46421, 2019, doi: 10.1109/ACCESS.2019.2909553.
X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley, "LSGAN," Proceedings of the IEEE International Conference on Computer Vision, 2017.
K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, "MTCNN," IEEE Signal Processing Letters, Vol.23, No.10, 2016.
H. Wang, S. Sridhar, J. Huang, J. Valentin, S. Song, and L. J. Guibas, "Normalized object coordinate space for category-level 6D object pose and size estimation," Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2019. https://doi.org/10.1109/CVPR.2019.00275.
B. J. B. Rani and L. M. E. Sumathi, "Survey on applying GAN for anomaly detection." 2020 International Conference on Computer Communication and Informatics (ICCCI), 2020. https://doi.org/10.1109/ICCCI48352.2020.9104046
J. Johnson, A. Alahi, and L. Fei-Fei, "Perceptual losses for real-time style transfer and super-resolution." Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 9906 LNCS, 2016. https://doi.org/10.1007/978-3-319-46475-6_43.
H. Y. Lee et al., "DRIT++: Diverse Image-to-Image Translation via Disentangled Representations," arXiv preprint arXiv:1905.01270, 2019.
E. Harkonen, A. Hertzmann, J. Lehtinen, and S. Paris, "GANSpace: Discovering interpretable GAN controls," Advances in Neural Information Processing Systems, Vol.33, pp.9841-9850, 2020.
X. Zhu, X. Liu, Z. Lei, and S. Z. Li, "Face alignment in full pose range: A 3D total solution," arXiv preprint arXiv: 1804.01005, 2018.
J. H. Lee, M. J. Sung, J. W. Kang, and D. Chen, "Learning dense representa tions of phrases at scale," arXiv preprint arXiv:2012.12624, 2020.