CNN과 Attention을 통한 깊이 화면 내 예측 방법

Intra Prediction Method for Depth Picture Using CNN and Attention Mechanism

  • 윤재혁 (동의대학교 컴퓨터소프트웨어공학과) ;
  • 이동석 (동의대학교 인공지능그랜드ICT연구센터) ;
  • 윤병주 (경북대학교 전자공학부) ;
  • 권순각 (동의대학교 컴퓨터소프트웨어공학과)
  • 투고 : 2024.03.18
  • 심사 : 2024.04.15
  • 발행 : 2024.04.30


본 논문에서는 CNN과 Attention 기법을 통한 깊이 영상의 화면 내 예측 방법을 제안한다. 제안하는 방법을 통해 예측하고자 하는 블록 내 화소마다 참조 화소를 선택할 수 있도록 한다. CNN을 통해 예측 블록의 상단과 좌단에서 각각 수직방향과 수평 방향의 공간적 특징을 검출한다. 두 공간적 특징은 예측블록과 참조 화소들에 대한 특징을 예측하기 위해 각각 특징차원과 공간적 차원으로 병합된다. Attention을 통해 예측 블록과 참조 화소간의 상관성을 입력된 공간적 특징을 통해 예측한다. Attention을 통해 예측된 상관성은 CNN 레이어를 통해 화소 도메인으로 복원되어 블록 내 화소 값이 예측된다. 제안된 방법이 VVC의 인트라 모드에 추가되었을 때 화면 예측 오차가 평균 5.8% 감소하였다.

In this paper, we propose an intra prediction method for depth picture using CNN and Attention mechanism. The proposed method allows each pixel in a block to predict to select pixels among reference area. Spatial features in the vertical and horizontal directions for reference pixels are extracted from the top and left areas adjacent to the block, respectively, through a CNN layer. The two spatial features are merged into the feature direction and the spatial direction to predict features for the prediction block and reference pixels, respectively. the correlation between the prediction block and the reference pixel is predicted through attention mechanism. The predicted correlations are restored to the pixel domain through CNN layers to predict the pixels in the block. The average prediction error of intra prediction is reduced by 5.8% when the proposed method is added to VVC intra modes.



이 논문은 정부(과학기술정보통신부)의 재원으로 정보통신기획평과원의 지원을 받아 수행된 지역지능화혁신인재양성사업(IITP-2024-2020-0-01791, 100%)과 부산광역시 및 (재)부산테크노파크의 BB21plus 사업임.


  1. Aguilar, W. G., Rodriguez, G. A., Alvarez, L., Sandoval, S., Quisaguano, F. and Limaico, A. (2017). Visual SLAM with a RGB-D Camera on A Quadrotor UAV Using On-board Processing, Proceedings of the Advances in Computational Intelligence: 14th International Work-Conference on Artificial Neural Networks, June 14-16, Cadiz, Spain., pp. 596-606.
  2. Bross, B., Wang, Y., Ye, Y., Liu, S., Chen, J., Sullivan, G. J. and Ohm, J. (2021). Overview of The Versatile Video Coding (VVC) Standard and Its Applications, IEEE Transactions on Circuits and Systems for Video Technology, 31(10), 3736-3764.
  3. Jiang, M. X., Luo, X. X., Hai, T., Wang, H. Y., Yang, S. and Abdalla, A. N. (2019). Visual Object Tracking in RGB-D Data via Genetic Feature Learning, Complexity, 4539410.
  4. Kwon, S. K., Kim, H. J. and Lee, D. S. (2017). Face Recognition Method Based on Local Binary Pattern using Depth Images, Journal of Korea Society of Industrial Information Systems, 22(6), 39-45.
  5. Kwon, S. K., Tamhankar, A. and Rao, K. R. (2006). Overview of H.264/MPEG-4 Part 10, Journal of Visual Communication and Image Representation, 17(2), 186-216.
  6. Lee, D. S. and Kwon, S. K. (2022). Intra Prediction Method for Depth Video Coding by Block Clustering through Deep Learning, Sensors, 22(24), 9656.
  7. Lee, D. S., Kim, B. G. and Kwon, S. K. (2021). Efficient Depth Data Coding Method Based on Plane Modeling for Intra Prediction, IEEE Access, 9, 29153-29164.
  8. Lee, D. S. and Kwon, S. K. (2019). Vehicle Plate Detection Method by Measuring Plane Similarity Using Depth Information, Journal of Korea Society of Industrial Information Systems, 24(2), 47-55.
  9. Li, Y. (2012). Hand Gesture Recognition Using Kinect, Proceedings of the 2012 IEEE International Conference on Computer Science and Automation Engineering, June 22-24, Beijing, Chian, pp. 196-199.
  10. Li, Y., Miao, Q., Tian, K., Fan, Y., Xu, X., Li, R. and Song, J. (2016). Large-scale Gesture Recognition with A Fusion of RGB-D Data Based on The C3D Model, Proceedings of the 23rd international conference on pattern recognition, Dec. 4-8, Cancun, Mexico, pp. 25-30.
  11. Nenci, F., Spinello, L. and Stachniss, C. (2014). Effective Compression of Range Data Streams for Remote Robot Operations Using H.264, Proceedings of the 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, Sep. 14-18, Chicago, IL, USA, pp. 3794-3799.
  12. Ren, C. Y., Prisacariu, V. A., Kahler, O., Reid, I. D. and Murray, D. W. (2017). Real-time Tracking of Single and Multiple Objects from Depth-colour Imagery Using 3D Signed Distance Functions, International Journal of Computer Vision, 124, 80-95.
  13. Ren, Z., Yuan, J., Meng, J. and Zhang, Z. (2013). Robust Part-based Hand Gesture Recognition Using Kinect Sensor, IEEE transactions on multimedia, 15(5), 1110-1120.
  14. Oh, K. J., Han, D. H. and Kwon, S. K. (2018). Character Floating Hologram Using Detection of User's Height and Motion by Depth Image, Journal of Korea Society of Industrial Information Systems, 23(4), 33-40.
  15. Silberman, N., Hoiem, D., Kohli, P. and Fergus, R. (2012). Indoor Segmentation and Support Inference from RGBD Images, Proceedings of the 12th European Conference on Computer Vision, Oct. 7-13, Florence, Italy, pp. 746-760.
  16. Stankiewicz, O., Wegner, K. and Domanski, M. (2013). Nonlinear Depth Representation for 3D Video Coding, Proceedings of the IEEE International Conference on Image Processing, Sep. 15-18, Melbourne, Australia, pp. 1752-1756.
  17. Sullivan, G. J., Ohm, J. R., Han, W. J. and Wiegand, T. (2012). Overview of The High Efficiency Video Coding (HEVC) Standard, IEEE Transactions on Circuits and Systems for Video Technology, 22(12), 1649-1668.
  18. Sun, Y., Liu, M. and Meng, M. Q. H. (2017). Improving RGB-D SLAM in Dynamic Environments: A Motion Removal Approach, Robotics and Autonomous Systems, 89, 110-122.
  19. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., L. Kaiser. and Polosukhin, I. (2017). Attention is All You Need, Proceedings of the Neural Information Processing Systems, Dec. 4-9, Long Beach, CA, USA, pp. 5998-6008.
  20. Zhao, Y., Carraro, M., Munaro, M. and Menegatti, E. (2017). Robust Multiple Object Tracking in RGB-D Camera Networks, Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Sep. 24-28, Vancouver, Canada, pp. 6625-6632.