Browse > Article
http://dx.doi.org/10.3745/KTSDE.2018.7.2.69

ORMN: A Deep Neural Network Model for Referring Expression Comprehension  

Shin, Donghyeop (경기대학교 컴퓨터과학과)
Kim, Incheol (경기대학교 컴퓨터과학과)
Publication Information
KIPS Transactions on Software and Data Engineering / v.7, no.2, 2018 , pp. 69-76 More about this Journal
Abstract
Referring expressions are natural language constructions used to identify particular objects within a scene. In this paper, we propose a new deep neural network model for referring expression comprehension. The proposed model finds out the region of the referred object in the given image by making use of the rich information about the referred object itself, the context object, and the relationship with the context object mentioned in the referring expression. In the proposed model, the object matching score and the relationship matching score are combined to compute the fitness score of each candidate region according to the structure of the referring expression sentence. Therefore, the proposed model consists of four different sub-networks: Language Representation Network(LRN), Object Matching Network (OMN), Relationship Matching Network(RMN), and Weighted Composition Network(WCN). We demonstrate that our model achieves state-of-the-art results for comprehension on three referring expression datasets.
Keywords
Referring Expression Comprehension; Deep Learning; Contextual Information; Weighted Composition;
Citations & Related Records
연도 인용수 순위
  • Reference
1 J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy, "Generation and Comprehension of Unambiguous Object Descriptions," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp.11-20, 2016.
2 R. Luo and G. Shakhnarovich, "Comprehension-Guided Referring Expressions," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2017.
3 V. K. Nagaraja, V. I. Morariu, and L. S. Davis, "Modeling Context Between Objects for Referring Expression Understanding," Proceedings of the European Conference on Computer Vision(ECCV), 2016.
4 R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell, "Natural Language Object Retrieval," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.4555-4564, 2016.
5 R. Hu, M. Rohrbach, J. Andreas, T. Darrell, and K. Saenko, "Modeling Relationships in Referential Expressions with Compositional Modular Networks," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.1115-1124, 2017.
6 L. Yu, P. Porison, S. Yang, A. C. Berg, and T. L. Berg, "Modeling Context in Referring Expressions," Proceedings of the European Conference on Computer Vision(ECCV), pp.69-85, 2016.
7 S. Ren, K. He, R. Girshick, and J. Sun, "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," Proceedings of the Neural Information Processing Systems(NIPS), pp.91-99, 2015.
8 J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You Only Look Once: Unified, Real-Time Object Detection," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp.779-788, 2016.
9 W. Liu, D. Anguelow, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, and A. C. Berg, "SSD: Single Shot MultiBox Detector," Proceedings of the European Conference on Computer Vision(ECCV), pp.21-37, 2016.
10 T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick, "Microsoft COCO: Common Objects in Context," Proceedings of the European Conference on Computer Vision(ECCV), pp.740-755, 2014.
11 L. Yu, H. Tan, M. Bansal, and T. L. Berg, "A Joint Speaker-Listener-Reinforcer Model for Referring Expressions," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp.7282-7290, 2017.
12 J. Krishnamurthy and T. Kollar, "Jointly Learning to Parse and Perceive: Connecting Natural Language to the Physical World," Proceedings of the Transactions of the Association for Computational Linguistics(TACL), Vol.1, pp.193-206, 2013.
13 J. Pennington, R. Socher, and C. Manning, "GloVe: Global Vectors for Word Representation," Proceedings of the Conference on Empirical Methods in Natural Language Processing(EMNLP), pp.1532-1543, 2014.