ORMN: A Deep Neural Network Model for Referring Expression Comprehension

Shin, Donghyeop;Kim, Incheol;

doi:10.3745/KTSDE.2018.7.2.69

KIPS Transactions on Software and Data Engineering (정보처리학회논문지:소프트웨어 및 데이터공학)

Volume 7 Issue 2
/
Pages.69-76
/
2018
/
2287-5905(pISSN)
/
2734-0503(eISSN)

Korea Information Processing Society (한국정보처리학회)

DOI QR Code

ORMN: A Deep Neural Network Model for Referring Expression Comprehension

ORMN: 참조 표현 이해를 위한 심층 신경망 모델

신동협 (경기대학교 컴퓨터과학과) ;
김인철 (경기대학교 컴퓨터과학과)

Received : 2017.12.11
Accepted : 2018.01.01
Published : 2018.02.28

https://doi.org/10.3745/KTSDE.2018.7.2.69 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Referring expressions are natural language constructions used to identify particular objects within a scene. In this paper, we propose a new deep neural network model for referring expression comprehension. The proposed model finds out the region of the referred object in the given image by making use of the rich information about the referred object itself, the context object, and the relationship with the context object mentioned in the referring expression. In the proposed model, the object matching score and the relationship matching score are combined to compute the fitness score of each candidate region according to the structure of the referring expression sentence. Therefore, the proposed model consists of four different sub-networks: Language Representation Network(LRN), Object Matching Network (OMN), Relationship Matching Network(RMN), and Weighted Composition Network(WCN). We demonstrate that our model achieves state-of-the-art results for comprehension on three referring expression datasets.

참조 표현이란 장면 영상 내의 특정 물체를 가리키는 자연어 문장들을 의미한다. 본 논문에서는 참조 표현 이해를 위한 새로운 심층 신경망 모델을 제안한다. 본 논문에서 제안하는 모델은 장면 영상 내 대상 물체의 영역을 찾아내기 위해, 참조 표현에서 언급하는 대상 물체뿐만 아니라 보조 물체, 그리고 대상 물체와 보조 물체 사이의 관계까지 풍부한 정보를 활용한다. 또한 제안 모델에서는 영상 내 각 후보 영역의 적합도 계산을 위해 물체 적합도와 관계 적합도를 참조 표현의 문장 구조에 따라 결합한다. 따라서, 본 모델은 크게 총 네 가지 서브 네트워크들로 구성된다: 언어 표현 네트워크(LRN), 물체 정합 네트워크(OMN), 관계 정합 네트워크(RMN), 그리고 가중 결합 네트워크(WCN). 본 논문에서는 세 가지 서로 다른 참조 표현 데이터집합들을 이용한 실험을 통해, 제안 모델이 현존 최고 수준의 참조 표현 이해 성능을 보인다는 것을 입증하였다.

Keywords

References

J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy, "Generation and Comprehension of Unambiguous Object Descriptions," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp.11-20, 2016.
R. Luo and G. Shakhnarovich, "Comprehension-Guided Referring Expressions," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2017.
V. K. Nagaraja, V. I. Morariu, and L. S. Davis, "Modeling Context Between Objects for Referring Expression Understanding," Proceedings of the European Conference on Computer Vision(ECCV), 2016.
R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell, "Natural Language Object Retrieval," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.4555-4564, 2016.
R. Hu, M. Rohrbach, J. Andreas, T. Darrell, and K. Saenko, "Modeling Relationships in Referential Expressions with Compositional Modular Networks," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.1115-1124, 2017.
L. Yu, P. Porison, S. Yang, A. C. Berg, and T. L. Berg, "Modeling Context in Referring Expressions," Proceedings of the European Conference on Computer Vision(ECCV), pp.69-85, 2016.
S. Ren, K. He, R. Girshick, and J. Sun, "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," Proceedings of the Neural Information Processing Systems(NIPS), pp.91-99, 2015.
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You Only Look Once: Unified, Real-Time Object Detection," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp.779-788, 2016.
W. Liu, D. Anguelow, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, and A. C. Berg, "SSD: Single Shot MultiBox Detector," Proceedings of the European Conference on Computer Vision(ECCV), pp.21-37, 2016.
T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick, "Microsoft COCO: Common Objects in Context," Proceedings of the European Conference on Computer Vision(ECCV), pp.740-755, 2014.
L. Yu, H. Tan, M. Bansal, and T. L. Berg, "A Joint Speaker-Listener-Reinforcer Model for Referring Expressions," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp.7282-7290, 2017.
J. Krishnamurthy and T. Kollar, "Jointly Learning to Parse and Perceive: Connecting Natural Language to the Physical World," Proceedings of the Transactions of the Association for Computational Linguistics(TACL), Vol.1, pp.193-206, 2013.
J. Pennington, R. Socher, and C. Manning, "GloVe: Global Vectors for Word Representation," Proceedings of the Conference on Empirical Methods in Natural Language Processing(EMNLP), pp.1532-1543, 2014.