DOI QR코드

DOI QR Code

Trends in Video Visual Relationship Understanding

비디오 시각적 관계 이해 기술 동향

  • Y.J. Kwon ;
  • D.H. Kim ;
  • J.H. Kim ;
  • S.C. Oh ;
  • J.S. Ham ;
  • J.Y. Moon
  • 권용진 (시각지능연구실) ;
  • 김대회 (시각지능연구실) ;
  • 김종희 (시각지능연구실) ;
  • 오성찬 (시각지능연구실) ;
  • 함제석 (시각지능연구실) ;
  • 문진영 (시각지능연구실)
  • Published : 2023.12.01

Abstract

Visual relationship understanding in computer vision allows to recognize meaningful relationships between objects in a scene. This technology enables the extraction of representative information within visual content. We discuss the technology of visual relationship understanding, specifically focusing on videos. We first introduce visual relationship understanding concepts in videos and then explore the latest existing techniques. Next, we present benchmark datasets commonly used in video visual relationship understanding. Finally, we discuss future research directions in video visual relationship understanding.

Keywords

Acknowledgement

이 논문은 과학기술정보통신부의 재원으로 정보통신기획평가원의 지원을 받아 수행된 연구임[No. 2020-0-00004, 장기 시각 메모리 네트워크 기반의 예지형 시각지능 핵심기술 개발].

References

  1. J. Johnson et al., "Image retrieval using scene graphs," in Proc. IEEE/CVF CVPR, (Boston, MA, USA), June 2015, pp. 3668-3678.
  2. C. Lu et al., "Visual relationship detection with language priors," in Proc. ECCV, Oct. 2016, pp. 852-569.
  3. R. Krishna et al., "Visual genome: Connecting language and vision using crowdsourced dense image annotations," Int. J. Comput. Vis., vol. 123, no. 1, May 2017, pp. 32-73. https://doi.org/10.1007/s11263-016-0981-7
  4. J. Ji et al., "Action genome: actions as compositions of spatio-temporal scene graphs," in Proc. IEEE/CVF CVPR, June 2020, pp. 10233-10244.
  5. Y. Zhong et al., "Comprehensive image captioning via scene graph decomposition," in Proc. ECCV, Aug. 2020, pp. 211-229.
  6. X. Yang et al., "Auto-encoding and distilling scene graphs for image captioning," IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 5, May 2022, pp. 2313-2327.
  7. X. Lu and Y. Gao, "Guide and interact: SceneGraph based generation and control of video captions," Multimed. Syst., vol. 29, no. 2, Apr. 2023, pp. 797-809.
  8. C. Zhang et al., "An empirical study on leveraging scene graphs for visual question answering," in Proc. BMVC, Sept. 2019.
  9. L. Li et al., "Relation-aware graph attention network for visual question answering," in Proc. IEEE/CVF ICCV, Oct. 2019, pp. 10312-10321.
  10. J. Mao et al., "Dynamic multistep reasoning based on video scene graph for video question answering," in Proc. NAACL, Jul. 2022, pp. 3894-3904.
  11. M. Qi et al., "Online cross-modal scene retrieval by binary representation and semantic graph," in Proc. ACM MM, Oct. 2017, pp. 744-752.
  12. M. Daum et al., "VOCAL: Video organization and interactive compositional analytics," in Proc. CIDR, Jan. 2022.
  13. X. Chang et al., "A Comprehensive survey of scene graphs: generation and application," IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 1, 2023, pp. 1-26. https://doi.org/10.1109/TPAMI.2021.3137605
  14. O. Russakovsky et al., "ImageNet large scale visual recognition challenge," Int. J. Comput. Vis., vol. 115, no. 3, 2015, pp. 211-252. https://doi.org/10.1007/s11263-015-0816-y
  15. C. Liu et al., "Beyond short-term snippet: Video relation detection with spatio-temporal global context," in Proc. IEEE/CVF CVPR, June 2020, pp. 10837-10846.
  16. Y. Li et al., "Interventional video relation detection," in Proc. ACM MM, Oct. 2021, pp. 4091-4099.
  17. X. Shang et al., "Video visual relation detection," in Proc. ACM MM, Oct. 2017, pp. 1300-1308.
  18. A. Vaswani et al., "Attention is all you need," in Proc. NIPS, Dec. 2017, pp. 5998-6008.
  19. Y.H.H. Tsai et al., "Video relationship reasoning using gated spatio-temporal energy graph," in Proc. IEEE/CVF CVPR, June 2019, pp. 10416-10425.
  20. X. Qian et al., "Video relation detection with spatiotemporal graph," in Proc. ACM MM, Oct. 2019, pp. 84-93.
  21. T. N. Kipf and M. Welling, "Semi-supervised classification with graph convolutional networks," in Proc. ICLR, Apr. 2017.
  22. L. Bertinetto et al., "Fully-connected siamese networks for object tracking," in Proc. ECCVW, Oct. 2016, pp. 850-865.
  23. Q. Cao et al., "3-D relation network for visual relation recognition in videos," Neurocomputing, vol. 432, 2021, pp. 91-100. https://doi.org/10.1016/j.neucom.2020.12.029
  24. X. Shang et al., "Video visual relation detection via iterative inference," in Proc. ACM MM, Oct. 2021, pp. 3654-3663.
  25. S. Chen et al., "Social fabric: tubelet compositions for video relation detection," in Proc. IEEE/CVF ICCV, Oct. 2021, pp. 13465-13474.
  26. K. Gao et al., "Classification-then-grounding: Reformulating video scene graphs as temporal bipartite graphs," in Proc. IEEE/CVF CVPR, June 2022, pp. 19475-19484.
  27. C. Lu et al., "DEBUG: A dense bottom-up grounding approach for natural language video localization," in Proc. EMNLP-IJCNLP, Nov. 2019, pp. 5144-5153.
  28. Y. Teng et al., "Target adaptive context aggregation for video scene graph generation," in Proc. IEEE/CVF ICCV, Oct. 2021, pp. 13668-13677.
  29. Y. Cong et al., "Spatial-temporal transformer for dynamic scene graph generation," in Proc. IEEE/CVF ICCV, Oct. 2021, pp. 16352-16363.
  30. Y. Li et al., "Dynamic scene graph generation via anticipatory pre-training," in Proc. IEEE/CVF CVPR, June 2022, pp. 13864-13873.
  31. S. Feng et al., "Exploiting long-term dependencies for generating dynamic scene graphs," in Proc. IEEE/CVF WACV, Jan. 2023, pp. 5119-5128.
  32. S. Nag et al., "Unbiased Scene graph generation in videos," in Proc. IEEE/CVF CVPR, June 2023, pp. 22803-22813.
  33. L. Xu et al., "Meta spatio-temporal debiasing for video scene graph generation," in Proc. ECCV, Oct. 2022, pp. 374-390.
  34. X. Shang et al., "Annotating objects and relations in user-generated videos," in Proc. ACM ICMR, June 2019, pp. 279-287.
  35. B. Thomee et al., "YFCC100M: The new data in multimedia research," Commun. ACM, vol. 59, no. 2, 2016, pp. 64-73. https://doi.org/10.1145/2812802
  36. J. Ji et al., "Action genome: actions as compositions of spatio-temporal scene graphs," in Proc. IEEE/CVF CVPR, June 2020, pp. 10233-10244.
  37. G. A. Sigurdsson et al., "Hollywood in homes: Crowdsourcing data collection for activity understanding," in Proc. ECCV, Oct. 2016, pp. 510-526.
  38. J. Yang et al., "Panoptic video scene graph generation," in Proc. IEEE/CVF CVPR, June 2023, pp. 18675-18685.