트랜스포머 기반의 다중 시점 3차원 인체자세추정

Multi-View 3D Human Pose Estimation Based on Transformer

  • 최승욱 (숭실대학교 융합소프트웨어학과) ;
  • 이진영 (숭실대학교 소프트웨어학과) ;
  • 김계영 (숭실대학교 소프트웨어학부)
  • 투고 : 2023.08.14
  • 심사 : 2023.11.09
  • 발행 : 2023.12.29

초록

3차원 인체자세추정은 스포츠, 동작인식, 영상매체의 특수효과 등의 분야에서 널리 활용되고 있는 기술이다. 이를 위한 여러 방법들 중 다중 시점 3차원 인체자세추정은 현실의 복잡한 환경에서도 정밀한 추정을 하기 위해 필수적인 방법이다. 하지만 기존 다중 시점 3차원 인체자세추정 모델들은 3차원 특징 맵을 사용함에 따라 시간 복잡도가 높은 단점이 있다. 본 논문은 계산 복잡도가 적은 트랜스포머 기반 기존 단안 시점 다중 프레임 모델을 다중 시점에 대한 3차원 인체자세추정으로 확장하는 방법을 제안한다. 다중 시점으로 확장하기 위하여 먼저 2차원 인체자세 검출자 CPN(Cascaded Pyramid Network)을 활용하여 획득한 4개 시점의 17가지 관절에 대한 2차원 관절좌표를 연결한 8차원 관절좌표를 생성한다. 그 다음 이들을 패치 임베딩 한 뒤 17×32 데이터로 변환하여 트랜스포머 모델에 입력한다. 마지막으로, 인체자세를 출력하는 MLP(Multi-Layer Perceptron) 블록을 매 반복 마다 사용한다. 이를 통해 4개 시점에 대한 3차원 인체자세추정을 동시에 수정한다. 입력 프레임 길이 27을 사용한 Zheng[5]의 방법과 비교했을 때 제안한 방법의 모델 매개변수의 수는 48.9%, MPJPE(Mean Per Joint Position Error)는 20.6mm(43.8%) 감소했으며, 학습 횟수 당 평균 학습 소요 시간은 20배 이상 빠르다.

The technology of Three-dimensional human posture estimation is used in sports, motion recognition, and special effects of video media. Among various methods for this, multi-view 3D human pose estimation is essential for precise estimation even in complex real-world environments. But Existing models for multi-view 3D human posture estimation have the disadvantage of high order of time complexity as they use 3D feature maps. This paper proposes a method to extend an existing monocular viewpoint multi-frame model based on Transformer with lower time complexity to 3D human posture estimation for multi-viewpoints. To expand to multi-viewpoints our proposed method first generates an 8-dimensional joint coordinate that connects 2-dimensional joint coordinates for 17 joints at 4-vieiwpoints acquired using the 2-dimensional human posture detector, CPN(Cascaded Pyramid Network). This paper then converts them into 17×32 data with patch embedding, and enters the data into a transformer model, finally. Consequently, the MLP(Multi-Layer Perceptron) block that outputs the 3D-human posture simultaneously updates the 3D human posture estimation for 4-viewpoints at every iteration. Compared to Zheng[5]'s method the number of model parameters of the proposed method was 48.9%, MPJPE(Mean Per Joint Position Error) was reduced by 20.6 mm (43.8%) and the average learning time per epoch was more than 20 times faster.

키워드

과제정보

본 연구는 과학기술정보통신부 및 정보통신기획평가원의 지역지능화혁신인재양성사업의 연구결과로 수행되었음 (IITP-2023-RS-2022-00156360)

참고문헌

  1. Julieta Martinez, Rayat Hossain, Javier Romero, James J. Little, "A simple yet effective baseline for 3d human pose estimation," Proceedings of the IEEE International Conference on Computer Vision, pp. 2640-2649, 2017.
  2. Long Zhao, Xi Peng, Yu Tian, Mubbasir Kapadia, Dimitris N. Metaxas, "Semantic graph convolutional networks for 3d human pose regression," Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3425-3435, 2019.
  3. Dario Pavllo, Christoph Feichtenhofer, David Grangier, Michael Auli, "3d human pose estimation in video with temporal convolutions and semi-supervised training," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7753-7762, 2019.
  4. Ailing Zeng, Xiao Sun, Lei Yang, Nanxuan Zhao, Minhao Liu, Qiang Xu, "Learning Skeletal Graph Neural Networks for Hard 3D Pose Estimation," Proceedings of the IEEE/CVF international conference on computer vision, pp. 11436-11445, 2021.
  5. Ce Zheng, Sijie Zhu, Matias Mendieta, Taojiannan Yang, Chen Chen, Zhengming Ding, "3d human pose estimation with spatial and temporal transformers," Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11656-11665, 2021.
  6. Yihui He, Rui Yan, Katerina Fragkiadaki, ShoouI Yu, "Epipolar transformers," Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp. 7779-7788, 2020.
  7. Karim Iskakov, Egor Burkov, Victor Lempitsky, Yury Malkov, " Learnable triangulation of human pose," Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7718-7727, 2019.
  8. Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, Jian Sun, "Cascaded pyramid network for multi-person pose estimation," Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7103-7112, 2018.
  9. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby, "An image is worth 16x16 words: Transformers for image recognition at scale," arXiv preprint, arXiv:2010.11929, 2020.
  10. Fuyang Huang, Ailing Zeng, Minhao Liu, Qiuxia Lai, Qiang Xu, "Deepfuse: An imu-aware network for real-time 3d human pose estimation from multi-view image," Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 429-438, 2020.
  11. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, "Attention is all you need," Advances in neural information processing systems, 30, 2017.
  12. Jingwei Xu, Zhenbo Yu, Bingbing Ni, Jiancheng Yang, Xiaokang Yang, Wenjun Zhang, "Deep kinematics analysis for monocular 3d human pose estimation," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 899-908, 2020.
  13. Jingbo Wang, Sijie Yan, Yuanjun Xiong, Dahua Lin, "Motion guided 3d pose estimation from videos," European Conference on Computer Vision. Cham: Springer International Publishing, pp. 764-780, 2020.
  14. Hui Shuai, Lele Wu, Qingshan Liu, "Adaptive multi-view and temporal fusing transformer for 3d human pose estimation," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45 no. 4, pp. 4122-4135, 2022.
  15. Catalin Ionescu, Dragos Papava, Vlad Olaru, Cristian Sminchisescu, "Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments," IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 7, pp. 1325-1339, 2013.