DOI QR코드

DOI QR Code

Deep Neural Network-Based Scene Graph Generation for 3D Simulated Indoor Environments

3차원 가상 실내 환경을 위한 심층 신경망 기반의 장면 그래프 생성

  • Received : 2018.12.27
  • Accepted : 2019.03.09
  • Published : 2019.05.31

Abstract

Scene graph is a kind of knowledge graph that represents both objects and their relationships found in a image. This paper proposes a 3D scene graph generation model for three-dimensional indoor environments. An 3D scene graph includes not only object types, their positions and attributes, but also three-dimensional spatial relationships between them, An 3D scene graph can be viewed as a prior knowledge base describing the given environment within that the agent will be deployed later. Therefore, 3D scene graphs can be used in many useful applications, such as visual question answering (VQA) and service robots. This proposed 3D scene graph generation model consists of four sub-networks: object detection network (ObjNet), attribute prediction network (AttNet), transfer network (TransNet), relationship prediction network (RelNet). Conducting several experiments with 3D simulated indoor environments provided by AI2-THOR, we confirmed that the proposed model shows high performance.

장면 그래프는 영상 내 물체들과 각 물체 간의 관계를 나타내는 지식 그래프를 의미한다. 본 논문에서는 3차원 실내 환경을 위한 3차원 장면 그래프를 생성하는 모델을 제안한다. 3차원 장면 그래프는 물체들의 종류와 위치, 그리고 속성들뿐만 아니라, 물체들 간의 3차원 공간 관계들도 포함한다. 따라서 3차원 장면 그래프는 에이전트가 활동할 실내 환경을 묘사하는 하나의 사전 지식 베이스로 볼 수 있다. 이러한 3차원 장면 그래프는 영상 기반의 질문과 응답, 서비스 로봇 등과 같은 다양한 분야에서 유용하게 활용될 수 있다. 본 논문에서 제안하는 3차원 장면 그래프 생성 모델은 크게 물체 탐지 네트워크(ObjNet), 속성 예측 네트워크(AttNet), 변환 네트워크(TransNet), 관계 예측 네트워크(RelNet) 등 총 4가지 부분 네트워크들로 구성된다. AI2-THOR가 제공하는 3차원 실내 가상환경들을 이용한 다양한 실험들을 통해, 본 논문에서 제안한 모델의 높은 성능을 확인할 수 있었다.

Keywords

JBCRJM_2019_v8n5_205_f0001.png 이미지

Fig. 1. An Example of Scene Graph

JBCRJM_2019_v8n5_205_f0002.png 이미지

Fig. 2. 3D Scene Graph Generation

JBCRJM_2019_v8n5_205_f0003.png 이미지

Fig. 3. 3D Scene Graph Generation Model

JBCRJM_2019_v8n5_205_f0004.png 이미지

Fig. 4. Attribute Prediction Network (AttNet)

JBCRJM_2019_v8n5_205_f0005.png 이미지

Fig. 5. Transfer Network (TransNet)

JBCRJM_2019_v8n5_205_f0006.png 이미지

Fig. 6. Storing Object Information in Object Memory

JBCRJM_2019_v8n5_205_f0007.png 이미지

Fig. 7. 3D Intersection over Union (3D IoU)

JBCRJM_2019_v8n5_205_f0008.png 이미지

Fig. 8. Relationship Recognition Network (RelNet)

JBCRJM_2019_v8n5_205_f0009.png 이미지

Fig. 9. 3D Scene Graphs Generated by the Proposed Model

Table 1. Performance Analysis of AttNet

JBCRJM_2019_v8n5_205_t0001.png 이미지

Table 2. Performance Analysis of TransNet

JBCRJM_2019_v8n5_205_t0002.png 이미지

Table 3. Performance Analysis of RelNet

JBCRJM_2019_v8n5_205_t0003.png 이미지

Table 4. Performance Analysis of Total Model

JBCRJM_2019_v8n5_205_t0004.png 이미지

References

  1. Y. Guo, Y. Liu, and A. Oerlemans et al., "Deep Learning for Visual Understanding: A Review," Neurocomputing, Vol. 187, pp. 27-48, 2016. https://doi.org/10.1016/j.neucom.2015.09.116
  2. S. Aditya, Y. Yang, and C. Baral et al., "Image Understanding using Vision and Reasoning through Scene Description Graph," Computer Vision and Image Understanding, In Press, Available online 18 December, 2017.
  3. E. Kolve, R. Mottaghi, and D. Gordon et al., "AI2-THOR: An Interactive 3d Environment for Visual AI," arXiv preprint arXiv:1712.05474, 2017.
  4. D. Xu, Y. Zhu, and C. B. Choy et al., "Scene Graph Generation by Iterative Message Passing," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5410-5419, 2017.
  5. Y. Li, W. Ouyang, and B. Zhou et al., "Scene Graph Generation from Objects, Phrases and Region Captions," Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1261-1270, 2017.
  6. S. Ren, K. He, and R. Girshick et al., "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," Proceedings of the Neural Information Processing Systems (NIPS), pp. 91-99, 2015.
  7. C. Lu, R. Krishna, and M. Bernstein et al., "Visual Relationship Detection with Language Priors," Proceedings of the European Conference on Computer Vision(ECCV), pp. 852-869, 2016.
  8. B. Dai, Y. Zhang, and D. Lin, "Detecting Visual Relationships with Deep Relational Networks," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3298-3308. 2017.
  9. P. Gay, J. Stuart, and A. D. Bue, "Visual Graphs from Motion (VGfM): Scene understanding with Object Geometry Reasoning," arXiv preprint arXiv:1807.05933, 2018.
  10. S. Song and J. Xiao, "Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp. 808-816. 2016.
  11. A. Dai, A. X. Chang, and M. Savva et al., "ScanNet: Richlyannotated 3D Reconstructions of Indoor Scenes," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp. 5828-5839. 2018.
  12. D. Goron, A. Kembhavi, and M. Rastegari et al., "IQA: Visual Question Answering in Interactive Environments," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp. 4089-4098, 2018.
  13. J. Redmon and A. Farhadi, "YOLOv3: An Incremental Improvement," arXiv preprint arXiv:1804.02767, 2018.